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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



STATEMENT CLAIMING SMALL ENTITY STATUS 

{37 CFR §1.9(f) &1.27(c)} SMALL BUSINESS CONCERN 



Patent or Application No. 




Filed Or Issued Date 


July 21, 2000 


Title 


FAULT-TOLERANCE FRAMEWORK FOR AN 
EXTENDABLE COMPUTER ARCHITECTURE 


Attorney Docket No. 


ATAE-01015US0 DEL 



I hereby state that I am (check [ _/ ] one of the following): 

[ ] the owner of the small business concern identified below: 

[ _/ ] an official empowered to act on behalf of the small business concern identified 

below: 

NAME OF SMALL BUSINESS CONCERN: Market Engine Corporation 

ADDRESS OF SMALL BUSINESS CONCERN: 2140 Shattuck Avenue. Suite 710. Berkeley. CA 94704 

I hereby state that the above identified small business concern qualifies as a small business concern 
as defined in 13 CFR §121 for purposes of paying reduced fees to the United States Patent and 
Trademark Office. 

Questions as to size standards that qualify as a small business concern may be directed 
to Small Business Administration, Size Standards Staff, 409 Third Street, SW, Washington, DC 
20416. 

Under 13 CFR § 121, size standards are expressed either by the number of employees or 
annual receipts in millions of dollars of the business concern, including those of its affiliates. 
Generally, a concern qualifies as a small business concern if the number of its, together with its 
affiliate's, employees does not exceed 500 persons and its, together with its affiliate's, annual 
revenue does not exceed $5 million. 

For purposes of small business concern qualification, (1) the number of employees is the 
average over the previous fiscal year of the persons employed on a full-time, part-time or 
temporary basis during each of the pay periods of the fiscal year of the business concern and its 
affiliates, and (2) business concerns are affiliates of each other when either, directly or indirectly, 
one concern controls or has the power to control the other, or a third-party or parties controls or 
has the power to control both. Further information concerning small business concern 
qualification is available at web addresses: http://www.sba.gov/regulations/121a.html and 
http ://www. sba.gov/regulations/ 121b .html . 
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1 hereby state that fights under contract or law have been conveyed to and remain with the 
small business concern identified above with regard to the invention described in (check [ d. ] one of 
the following): 

[ j£ } the specification filed herewith with title as listed above, 

[ _ ] the application identified above, 
[ _ ] the patent identified above. 

If the rights held by the above-identified small business concern are not exclusive, each 
individual, concern or organization having rights to the invention raust file separate statements as to 
their status as small entities, and no rights to the invention are held by any person, other than the 
inventor, who would not qualify as an independent inventor under 37 CFR § 1.9(c) if that person 
made the invention, or by any concern which would not qualify as a small business concern under 37 
CFR§l,9(d) or a nonprofit organization under 37 CFR§1.9(e). 

Each person, concern or organization having any rights in the invention is listed below 
(check [j/]one of the following): 

[ J. ] No such person, concern or organization exists. 

[ _ ] Each such person, concern or organization is listed below. 

1* Entity 

NAME: 

ADDRESS: _ 

STATUS; [ ] Individual [ ] Small Business Concern [ ] Nonprofit Organization 

2 nd Entity 

NAME: 

ADDRESS: 

STATUS: [ ] Individual [ ] Small Business Concern [ ] Nonprofit Organization 

I acknowledge the duty to file, in this application or patent, notification of any change in 
status resulting in loss of entitlement to small entity status prior to paying, or at the time of paying, 
the earliest of the issue fee or any maintenance fee due after the date on which status as a small 
business entity is no longer appropriate. (37 CFR § 1.28(b)), 

NAME OF PERSON SIGNING: BehroozAtaee 

TITLE OF PERSON SIGNING IF OTHER THAN OWNER: President 

ADDRESS OF PE^N SIGNINg1 ^J4Q Shattuck Avenue. Suite 710. Berkeley. CA 94704 
SIGNATURE: ^^-^^H&CK^ y Date J uly 20. 2000 
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TITLE 

FAULT-TOLERANCE FRAMEWORK FOR AN EXTENDABLE 
COMPUTER ARCHITECTURE 



INVENTORS 

Drew Shaffer Roselli 
Rico (NMI) Blaser 
Mikel Carl Lechner 

CROSS-REFERENCE 

This application is a continuation-in-part of the application entitled MARKET ENGINES 
HAVING EXTENDABLE COMPONENT ARCHITECTURE, invented by Rico (NMI) Blaser; 
SC/Serial No. 09/360,899; Filing Date: Jan. 26, 2000. 

COPYRIGHT NOTICE 

A portion of the disclosure of this patent document contains material which is subject to 
copyright protection. The copyright owner has no objection to the facsimile reproduction by 
anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark 
Office patent file or records, but otherwise reserves all copyright rights whatsoever. 

BACKGROUND OF THE INVENTION 

The present invention relates to the field of electronic commerce (e-commerce) and par- 
ticularly to electronic systems in capital markets and other e-commerce applications with high 
availability and scalability requirements. 

Historically, mission critical applications have been written for and deployed on large 
mainframes, typically with built-in (hardware) or low-level operating system (software) fault- 
tolerance. In some prior art, such fault-tolerance mechanisms include schemes where multiple 
central processing units (CPUs) redundantly compute each operation and the results are used 
using a vote (in the case of three-way or more redundancy) or other logical comparisons of the 
redundant outcomes in order to detect and avoid failures. In some cases a fault-stop behavior is 
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implemented where it is preferred to stop and not execute a program operation when an error or 
other undesired condition will result. This fault-stop operation helps to minimize the propaga- 
tion of errors to other parts of the system. In other implementations, elaborate fault recovery 
mechanisms are implemented. These mechanisms typically only recover hardware failures since 
application failures tend to be specific to the particular application software. To detect errors in 
application software, vast amounts of error-handling code have been required. Certain financial 
applications have devoted as much as 90% to error detection and correction. Because of the 
enormous complexity of such software applications, it is nearly impossible to entirely eliminate 
failures that prevent the attainment of reliable and continuous operation. 

Increasingly, systems need to be available on a continuous basis, 24 hours per day, 7 days 
per week (24/7 operation). In such nonstop environments it is undesirable for a system to be un- 
available when system components are being replaced or software and hardware failures are de- 
tected. In addition, today's applications must scale to increasing user demands that in many 
cases exceed the processing capabilities of a single computer, regardless of size from small to 
mainframe. When the system load cannot be handled on a single machine, it has been difficult 
and costly to obtain a larger machine and move the application to the larger machine without 
downtime. Attempts to distribute work over two or more self-contained machines is often diffi- 
cult because the software typically has not been written to support distributed computations. 

For these reasons, the need for computational clusters has increased. In computational 
clusters, multiple self-contained nodes are used to collaboratively run applications. Such appli- 
cations are specifically written to run on clusters from the outset and once written for clusters, 
applications can run on any configuration of clustered machines from low-end machines to high- 
end machines and any combination thereof. When demand increases, the demand is easily satis- 
fied by adding more nodes. The newly added nodes can utilize the latest generation of hardware 
and operating systems without requiring the elimination or upgrading of older nodes. In other 
words, clusters tend to scale up seamlessly while riding the technology curve represented in new 
hardware and operating systems. Availability of the overall system is enhanced when cluster 
applications are written so as not to depend on any single resource in the cluster. As resources 
are added to or removed from a cluster, applications are dynamically rescheduled to redistribute 



Attorney Docket No.: ATAE1015DEL 

1015_00 A 07 A 20.fi.wpd 



Page 2 of 94 



Express Mail Label No.:EL328296286US 

7/20/0-22:31 



the workload. Even in the case where a significant portion of the cluster is down for service, the 
application can continue to run on the remaining portion of the cluster. This continued operation 
has significant advantages particularly when employed to implement a cluster-based component 
architecture of the type described in the above-identified cross-referenced application entitled 
MARKET ENGINES HAVING EXTENDABLE COMPONENT ARCHITECTURE. 

While clustering technology shows promise at overcoming problems of existing systems, 
there exists a need for practical clustering systems. In practical clustering systems, it is undesir- 
able for each application in a cluster system to manage its own resources. First, it is inefficient 
to have each application solve the same resource management problems. Second, scheduling for 
conflict resolution and load-balancing (which is important for scalability) is more effectively 
solved by a common flexible (extensible) resource manager that solves the common problem 
once, instead of solving the problem specifically for each application. Furthermore, failure 
states tend to be complex when each application behaves differently as a result of failures and 
with such differences, it is almost impossible to model the impact of such failures from applica- 
tion to application running on the cluster. To overcome these problems, commercial and aca- 
demic projects have arisen with the objective of providing a clustering architecture that provides 
isolation between physical systems and the applications they execute. 

To date, however, proposed clustering architectures are complex and can only handle a 
limited number of specific system failures. In addition, proposed clustering software does not 
appropriately scale up across multiple sites. There is a need, therefore, for a simple and elegant 
clustering architecture that includes fault-tolerance and load-balancing, that is extendable over 
many computer systems and that has a flexible interface for applications. In such an architec- 
ture, the number of failure states needs to be kept low so that extensive testing is possible to ren- 
der the system more predictability. Hardware as well as software failures need to be detected 
and resources need to be rescheduled automatically, both locally as well as remotely. Resched- 
uling needs to occur when a particular application or resource is in high demand. However, re- 
scheduling should be avoided when unnecessary because rescheduling can degrade application 
performance. When possible, rescheduling should only occur in response to resource shortages 
or to avoid near-term anticipated shortages. If the system determines that resource requirements 
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are likely to soon exceed the capacity of a system element, then the software might appropriately 
reschedule to avoid a sudden near-term crunch. The result of this "anticipatory" rescheduling is 
avoidance of resource bottlenecks and thereby improvement in overall application performance. 
The addition and removal of components and resources needs to occur seamlessly in the system. 

In view of the above background, it's an object of the present invention to provide an 
improved fault-tolerance framework for an extendable computer architecture. 

Summary 

The present invention is computer system having a fault-tolerance framework in an ex- 
tendable computer architecture. The computer system is formed of clusters of nodes where each 
node includes computer hardware and operating system software for executing jobs that imple- 
ment the services provided by the computer system. Jobs are distributed across the nodes under 
control of a hierarchical resource management unit. The resource management unit includes hi- 
erarchical monitors that monitor and control the allocation of resources. 

In the resource management unit, a first monitor, at a first level, monitors and allocates 
elements below the first level. A second monitor, at a second level, monitors and allocates ele- 
ments at the first level. The framework is extendable from the hierarchy of the first and second 
levels to higher levels where monitors at higher levels each monitor lower-level elements in a 
hierarchical tree. If a failure occurs down the hierarchy, a higher level monitor restarts an ele- 
ment at a lower level If a failure occurs up the hierarchy, a lower-level monitor restarts an ele- 
ment at a higher level. While it may be adequate to have two levels of monitors to keep the 
framework self-sufficient and self-repairing, more levels may be efficient without adding signifi- 
cant complexity. It is possible to have multiple levels of this hierarchy implemented in a single 
process. 

In some embodiments, each of the monitors includes termination code that causes an ele- 
ment to terminate if duplicate elements have been restarted for the same operation. The termina- 
tion code in one embodiment includes suicide code whereby an element will self-destruct when 
the element detects that it is an unnecessary duplicate element. 
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In one local level embodiment, the resource management unit includes agents as ele- 
ments in the first level where the agents monitor and control the allocation of jobs to nodes and 
includes a local coordinator in the second level where the local coordinator monitors and con- 
trols the allocation of jobs to agents. Also, the agents monitor the local coordinator. Failure of a 
job results in the monitoring agent for the failed job restarting a job to replace the failed job. 
Failure of an agent results in the monitoring agent for the failed agent restarting of an agent to 
replace the failed agent. Failure of the local coordinator results in restarting of a local coordina- 
tor to replace the failed local coordinator. In a particular example of a local level embodiment, 
the agents are implemented as host agents where a host agent only monitors the jobs running on 
one node. 

In a higher level hierarchy, one or more group coordinators are added at a group level 
above the local level where each group coordinator monitors and controls multiple local coordi- 
nators where each local coordinator monitors and controls lower level agents which in turn mon- 
itor and control lower level jobs. 

In a still higher level hierarchy, one or more universal coordinators are added at a univer- 
sal level above the group level where each universal coordinator monitors and controls multiple 
local coordinators where each local coordinator monitors and controls lower level agents which 
in turn monitor and control lower level jobs. 

The present computer system gives highest priority to maintaining the non-stop operation 
of important elements in the processing hierarchy which, in the present specification, is defined 
as operations that are jobs. While other resources such as the computer hardware, computer op- 
erating system software or communications links are important for any instantiation of a job that 
provide services, the failure of any particular computer hardware, operating system software, 
communications link or other element in the system is not important since upon such failure, the 
job is seamlessly restarted using another instantiation of the failing element. The quality of ser- 
vice of the computer system is represented by the ability to keep jobs running independently of 
what resource fails in the computer system by simply transferring a job that fails, appears to have 
failed or appears that failure is imminent and such transfer is made regardless of the cause and 
without necessarily diagnosing the cause of failure. 
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The present computer system utilizes redundancy of simple operations to overcome fail- 
ures of elements in the system. The redundancy is facilitated using hierarchical monitors that 
decouple fault-tolerance processes for monitoring failure from the services (executed by applica- 
tion programs that are implemented by jobs). 

An indication of progress of a service is determined by using, in applications that provide 
a service, the capability of processing progress messages. The progress messages traverse the 
vital paths of execution of the service before returning a result to the progress monitor. The 
progress monitor is independent of the fault-tolerance layer and does not interfere with fault-tol- 
erant operation. Restart of failing jobs is simple and quick without need to analyze the cause of 
failure or measure progress of the service. 

The present computer system inherently provides a way to seamlessly migrate operation 
to new or different hardware and software. Because the present computer system inherently as- 
signs jobs among available resources and automatically transfers jobs when failures occur, the 
same dynamic transfer capability is used seamlessly, maintaining non-stop operation, for system 
upgrade, system maintenance or other operation where new or different hardware and software 
are to be employed. 

The present computer system operates such that if any element is in a state that is un- 
known (such as a partial, possible or imminent failure) then the fault-tolerant operation reacts by 
assuming a complete failure has occurred and thereby immediately forces the system into a 
known state. The computer system does not try to analyze the failure or correct the failure for 
purposes of recovery, but immediately returns to a known good state and recalculates anything 
that may have happened since the last known good state. 

The present computer system works well in follow-the-sun operations. For example, the 
site of actual processing is moved from one location (for example, Europe) to another location 
(for example, US) where the primary site is Europe during primary European hours and the pri- 
mary site is US during primary US hours. Such follow-the-sun tends to achieve better perfor- 
mance and lower latency. The decision of when to switch over from one site to another can be 
controlled by a customer or can be automated. 
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The present system includes an interface that collects and provides output information 
and receives input information and commands that allow humans to monitor and control the 
computer system and each of the components and parts thereof. The interface logs data and 
processes the logged data to form statistics including up-time, down-time, failure, performance, 
configuration, versions, through-put and other component and system information. The interface 
provides data for system availability measurements, transaction tracking and other information 
that may be useful for satisfying obligations in service agreements with customers. 

The present system provides, when desired, customer process isolation. For example, 
first jobs running on first nodes associated with a first customer are isolated from second jobs 
associated with a second customer running on second nodes, where the second nodes are differ- 
ent from the first nodes. 

The foregoing and other objects, features and advantages of the invention will be appar- 
ent from the following detailed description in conjunction with the drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 depicts computer system consisting of distributed groups of clusters. 

FIG. 2 depicts details of the clusters of the type employed in FIG, 1. 

FIG. 3 depicts further details of the processes running on the clusters of FIG. 2. 

FIG. 4 depicts a logical view of the local job manager hierarchy running at levels in the 
hierarchy above jobs running on nodes of a platform. 

FIG. 5 depicts a logical view the multi-level hierarchy of the resource management unit 
with interfaces to jobs and nodes on lower level platforms. 

FIG. 6 depicts a logical view the multi-level hierarchy of the resource management unit 
with multiple universal coordinators at the universal level, with multiple group coordinators at 
the group level, with multiple local coordinators at the local level and with multiple agents at the 
agent level. 

FIG. 7 depicts an example of the implementation of a group level hierarchy with vertical 
integration of processes of the hierarchy on some nodes. 
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FIG. 8 depicts an example of the implementation of a group level hierarchy with vertical 
integration of levels of the hierarchy in single processes on some nodes. 

FIG. 9 depicts an example of the implementation of a group level hierarchy with horizon- 
tal integration of processes of the same levels on common nodes. 

FIG. 10 depicts details of fault-detection and correction during a simple job failure. 

FIG. 1 1 depicts recovery from a vertical failure. 

FIG. 12 depicts recovery from a horizontal failure. 

FIG. 13 depicts a conflict situation where multiple monitors replace a single failing ele- 
ment. 

FIG. 14 depicts examples of components relevant for financial services where the com- 
ponents are implemented as services on cluster computer system. 

FIG. 15 depicts an example of an e-commerce system using the components of FIG. 14. 

FIG. 16 depicts a logical view of the local job manager running with host agents at levels 
in the hierarchy above jobs running on nodes of a platform. 

DETAILED DESCRIPTION 

Cluster Groups FIG. 1 

In FIG. 1, a plurality of clusters 9 are distributed in different groups 5 including groups 
5-1, 5-2, 5-3, 5-G and connect through the networks 13 to form an e-commerce system 2. 
The groups 5 are organized on geographical, company, type of information processed or other 
logical basis. 

In one example, the groups 5 of clusters 9 in FIG. 1 are distributed geographically around 
the world. The group 5-1, for example, has clusters 9, and specifically clusters 9 l9 9 G1 , lo- 
cated in Europe. Group 5-2, by way of example, includes clusters 9, and specifically clusters 9 2 , 
9 G2 , located in Asia. Group 5-3, for example, includes clusters 9, and specifically clusters 9 3 , 
9 G3 , located in the eastern United States and group 5-G, by way of example, includes clusters 
9, and specifically clusters 9 G , 9^, located in the western United States. 

In a geographic distribution example, the FIG. 1 worldwide e-commerce system 2 is con- 
trolled in different ways. In one example, each group 5 is in a different region of the world 
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where each region controls worldwide transactions during the principal business hours of that 
region and where the control shifts to another region when the principal business hours shift 
thereby implementing a "follow-the-sun" operation. Since the principal business hours change 
as a function of time and location around the world, transactions that are principal at one point in 
time in one group 5 are shifted to another group 5 in another region at different times of day rela- 
tive to common time. 

In one embodiment of a follow-the-sun operation, a single site for a group 5 is at one lo- 
cation in the world and that single site serves customers around the world where primary access 
privileges for that site are passed in a follow-the-sun manner to different persons around the 
world. In that one embodiment, the primary access privileges participate in a follow-the-sun 
operation but the actual processing site does not change location in the world. In another em- 
bodiment of a follow-the-sun operation, multiple sites for multiple groups 5 at multiple locations 
in the world are enabled to serve customers around the world where the primary site for actual 
processing is re-designated from location to location so as to follow-the-sun. By moving the site 
of actual processing from one location (for example, Europe) to another location (for example, 
US) where the primary site is Europe during primary European hours and the primary site is US 
during primary US hours tends to achieve better performance and lower latency. The decision of 
when to switch over from one site to another can be controlled by the client or can be automated. 

In order to control the operations of the groups 5 of clusters 9, each group 5 includes one 
or more resource management units (RMUs) 8 for controlling the group operation. In one exam- 
ple, a resource management unit 8 is present in each cluster, whereby transactions are routed to 
different clusters in the same or different groups as a function of time or other parameters. Each 
resource management unit (RMU) 8 is associated with other processes including communication 
and function processes for supporting cluster operation and communication. 

In another example, the groups 5 of clusters 9 in FIG. 1 are organized based upon opera- 
tions of a single company or a group of companies. For example, group 5-1 includes all of the 
clusters 9 for a single company that service one geographic region (for example, Berlin) while 
group 5-2 includes all of the clusters 9 for the same company that service another geographic 
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* * 

region (for example, New York City). In such an example, resource management units (RMUs) 
8 control the group operations. 

In still another example where the groups 5 of clusters 9 in FIG. 1 are organized based 
upon operations of a single company, the group 5-1 includes all of the clusters 9 in that company 
5 that service a particular type of information (for example, one type of marketable instruments 
such as stocks) while group 5-2 includes all of the clusters 9 in that company that service another 
type of information (for example, another type of marketable instruments such as bonds or deriv- 
atives). In such an example, resource management units (RMUs) 8 control the group operations. 

The above examples illustrate that any combination of clusters 9 can be used to establish 
10 the common control functions within one or more groups 5 and within the e-commerce system 2. 

f « Multiple Cluster Design. - FIG. 2 

f_( In FIG. 2, typical ones of the clusters 9 of FIG. 1 are shown including clusters 9-1, 9-2, 

fij 9-CL The cluster 9-1 is typical of clusters 9 and includes one or more nodes 51 shown as 

3J nodes 51-1 1? 51-l 0s that are formed of one or more computers 43-l b 43-l Ha , each computer 

^ having corresponding operating systems (OSs) 42-1 including operating systems 42-1 1? 42- 

s l 0s5 respectively. Processes 41-1 are distributed to execute on the nodes formed of operating 

r ss s 

%j systems 42-1 and computers 43-1 of cluster 9-1. The processes 41-1 of cluster 9-1 are organized 
^ as belonging to a service unit 44-1 , a communications unit 45-1 and a resource management unit 
gP 464. 

In FIG. 2, the service unit 44-1 includes the services S l3 S 2 , S C1 and these services are 
the primary reason that cluster 9-1 exists. By way of example, if the primary purpose of cluster 
9-1 is to execute financial transactions in an e-commerce system, like the e-commerce system 
described in connection with FIG. 14, then the different services S 1? S 2? S CJ of the service unit 
25 44-1 correspond to some or all of the components 71-2, 71 -Co of FIG. 14. Each of the ser- 
vices of service unit 44-1 is partitioned into one or more jobs 30 for execution on a node 51. 

In FIG. 2, the communication unit 45-1 controls communications from and to the cluster 
9-1 and the other clusters 9-2, 9-C1 of FIG. 2. The communication unit 45-1 controls the 
intra-cluster communication with other communication units 45-2, 45-C1 of FIG. 2 over the 
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connection elements 67 and controls the extra-cluster communication external to the clusters 9 of 
FIG. 2 over the connection elements 68. 

In FIG. 2, resource management unit 46-1 includes, for example, processes that are units 
for fault tolerance, load balancing and persistent storage operations. 
5 While cluster 9-1 is typical of the clusters 9 of FIG. 1, each of the other clusters 9-2, 

9-C1 includes one or more nodes 51 shown as nodes 51-2 1? 5 1 -2 0s and so on to nodes 51-Cl l5 
51-Cl 0s that are formed of one or more computers 43 that are shown as 43-2 1? 43-2 Ha , and 
so on to 43-Cl 1? 43-Cl Ha , each computer having corresponding operating systems (OSs) 42 
including operating systems 42-2 1? 42-2 0s and so on to 42-Cl 1? 42-Cl 0s? respectively. Pro- 
10 cesses 41-2 and so on to 41 -CI have jobs that are distributed to execute on the nodes formed of 
operating systems 42-2 and computers 43-2 and so on to operating systems 42-C1 and computers 

0 43-C1 of cluster 9-2, 9-C1, respectively. The processes 41-2 and so on to 41-C1 of clusters 9-2 
|i and so on to 9-C1 are organized as belonging to service units 44-2 and so on to 44-C1, communi- 

1 If cations units 45-2 and so on to 45-C1 and resource management units 46-2 and so on to 46-C1. 
"IB The communication processes of the communication unit 45 of FIG. 2 are ones that are 
Zj suitable for the particular embodiment selected for the connection elements 67 and 68. The con- 
|L nection elements 67 and 68 are logical entities that rely on the necessary physical interconnec- 
s l tion of each of the clusters 9 and appropriate protocols for those interconnections. When the 
j* connection element 67 or 68 is implemented as a local area network using TCP/IP, for example, 
So the processes of communication units 45 provide for IP address assignment and addressing as a 

means to control communication among the clusters 9. When the connection element is imple- 
mented using point-to-point switching, for example, the communication processes are those suit- 
able for providing point-to-point switching protocols for transferring data between clusters 9. 
Regardless of the implementation of elements 67 and 68, the processes of communication units 
25 45 provide a logically consistent interface among clusters 9 that permits both homogeneous clus- 
ters (using the same hardware computers and operating systems) as well as heterogeneous clus- 
ters (using different hardware computers and/or operating systems) to transfer data. Nodes, in 
addition to being of different hardware and operating systems, may also run heterogeneous 
applications. The reasons for heterogeneous applications include, for example, environments 
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where special hardware that is needed to run an application is only available on certain nodes or 
special software that is needed to run an application is only available on certain nodes (for exam- 
ple, software licenses). 

In one particular embodiment, the communication unit 45 uses object serialization to 
transmit messages (or other) objects from one of the communication units 45 to another one of 
the communication units 45. This operation is done by initiating a network connection (for ex- 
ample a TCP/IP connection), then serializing the message object into a datastream which is usu- 
ally buffered. The data stream is then transmitted by the transmitting communication unit 45 
over the connection element 67 operating with the TCP/IP protocol to the receiving communica- 
tion unit 45 where it is de-serialized. In an example using the FIG. 14 system, one embodiment 
sends meta-data of a buy/sell order from the TI interface component 71-10 to the storage compo- 
nent 71-13 and subsequently to the crossing component 71-3. The Java Remote Method Invoca- 
tion (RMI) interface by Sun Microsystems can be used to implement such object serialization 
communication methods. 

For different message-types and embodiments of the connection element 67, the use of 
other communication protocols with different flow-control mechanisms, delivery guarantees and 
directory services are used. Various schemes over IP provide alternate embodiments. For exam- 
ple, heart-beat messages use the UDP/IP protocol because reliable delivery is not required. 
Communication protocols are not restricted to IP-based schemes, the only requirement is that 
both the transmitting cluster as well as the receiving cluster are capable of handling messages in 
a selected protocol Other messaging systems, such as Remote Procedure Call (RPC) and Active 
Messages, are acceptable implementations as well. 

In other embodiments, higher-level (fast) messaging systems are used to communicate 
between clusters. Examples include TIBCO or NEON messaging layers which are again able to 
completely abstract the communication layer from the underlying hardware clusters and thus 
effectively act as middle-ware. Other middleware products include Talarian Smart Sockets, Java 
Message Queue and Vitria. 

In further embodiments, multiple clusters run on the same hardware and operating system 
node using the same memory. In such embodiments, the same communication mechanisms are 
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used as described above. Additionally, specialized inter-process communication schemes can be 
used for improved performance and better use of system resources. 

In general, operations that are performed by the FIG. 2 system include jobs that execute 
to provide the services 44, include processes used in connection with the communication units 
45 and the resource management units 46 and include operating system calls for operating sys- 
tems 42, memory controls and availability determinations, network access control and latency 
determinations and any other operations useful in or in connection with the computer system of 
FIG. 2. 

Process Architecture - FIG. 3 

FIG. 3 depicts a logical overview of the architecture of a set of processes 41 typical of 
the distributed sets of processes 41-1, 41-2, 41 -CI in the clusters 9 of FIG. 2. A typical set of 
process 41 in FIG. 3 includes the service unit 44 processes that are typical of the distributed ser- 
vice units 44-1, 44-2, 44-C1 in the clusters 9 of FIG. 2. The service unit 44 processes include 
the services 44 1? 44 2 , 44 3? 44 s that are applications or functions that as a whole are typically 
distributed across multiple nodes of a cluster (that is, for cluster 9-1 of FIG. 2, across one or 
more computers 43-1 1? 43-l Ha , and corresponding operating systems 42-1 15 42-l 0s? respec- 
tively) or across nodes of multiple clusters. 

The set of processes 41 in FIG. 3 include the communication unit 45 processes that are 
typical of the distributed communication units 45-1, 45-2, 45-C1 in the clusters 9 of FIG. 2. 
The set of processes 41 in FIG. 3 include the resource management unit 46 processes that are 
typical of the distributed resource management units 46-1, 46-2, 46-C1 in the clusters 9 of 
FIG. 2. The resource management unit 46 includes a fault tolerance unit 46j for ensuring fault 
tolerant operation of the processes 41 and the clusters 9 on which they execute. The fault toler- 
ance unit 46! includes a job manager 48 for scheduling resources among the services 44 l5 44 2 , 
44 3 , 44 s . The resources scheduled include, for example, CPU time, disk and memory privi- 
leges and network bandwidth. While such resource management is a function that in conven- 
tional systems is usually performed by the operating system 42 on each node of a cluster 9 of 
FIG. 2, the distributed resource management unit 46 is provided to add fault tolerance, load-bal- 
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anting, persistent storage and output capabilities to each cluster 9 and to the global e-commerce 
system 2. 

For fault tolerance operation, if a hardware or software component fails on a node, the 
distributed resource management unit 46 through operation of the fault-tolerance unit auto- 
5 matically restarts the component on the same or a different node. If possible, restarting on the 
same node is desirable since in this way the failure is fixed at a lower level without having to 
make a call to a higher level. If not possible to restart on the same node, the operation restarts 
the interrupted component on a different node. If a cluster failure occurs, or if non-failing other 
nodes on a cluster are not suitable for restarting the component, all services are then restarted to 

10 run on a different cluster. If a group of clusters fail, all services are scheduled to run on a differ- 
ent group of clusters. A group of clusters has redundancy and ordinarily is not expected to fail. 
However, group failure may occur in some disasters (such as an earthquake or other environ- 

J3 mental calamity) but such occurrence is expected to be rare. In other situations, it may also be 
% I desirable to move services to another group of clusters without interrupting service. For exam- 

11 pie, planned maintenance, upgrades, load balancing and reconfiguration all may involve moving 
jz services among clusters and groups of clusters. 

For load balancing operation, the distributed resource management unit 46 through oper- 
O ation of the fault-tolerance unit 46 2 detects when a particular resource in a cluster or group of 
flj clusters is being taxed or is likely to be taxed more than other comparable resources and takes 
%p appropriate action to reschedule some of the jobs to a less taxed resource, thereby achieving 
O load-balancing. 

The distributed resource management unit 46 uses a persistent storage unit 46 3 in order to 
allow applications such as the services 44 u 44 2 , 44 3 , 44 s to store state information about exe- 
cuting processes to non-volatile memory of persistent storage unit 46 3 in a consistent way. Such 
25 state information typically includes computational results and data to checkpoint the executing 
application at restartable execution points. Checkpoints are selected to store operating parame- 
ters and progress of an application after major computational steps or at certain points in the exe- 
cution sequence. If a failure occurs, applications that operate with such checkpoints are restarted 
by the fault-tolerance unit 46! and/or the load-balancing unit 46 2 at the last successfully com- 
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pleted checkpoint. Because the persistent storage facility 46 3 is part of the resource management 
unit 46, state information can be transparently replicated to remote sites, allowing immediate 
fail-over even in the case of a site failure. 

The interface unit 46 4 is part of the resource management unit 46. The interface unit 46 4 
5 collects and provides output information and receives input information and commands that al- 
low humans to monitor and control the computer system 2 (see FIG. 1) and each of the compo- 
nents and parts thereof. The interface unit 46 4 logs data and processes the logged data to form 
statistics about the overall system and about each component in the system including up-time, 
down-time, failure, performance, configuration, versions, through-put and other component and 

10 system information. The interface unit 46 4 provides data for system availability measurements, 
transaction tracking and other information that may be desirable or required. Such output data 

ri is useful for, among other things, satisfying obligations in service agreements with customers 
that require contracted levels of system availability and transaction tracking for satisfying legal 

ff j or other obligations. The interface unit 46 4 has an internal unit 46 4-I that provides full data and 

IT$ control to system administrators and others having authority to access the system for such full 
access. The interface unit 46 4 also has an external unit 46 4 . 2 that provides one or more levels of 

5 access to customers or others not having authority for full system access. Typically, the external 

unit is used by or for customers to monitor the overall availability of a service being delivered to 
the customers. 

j2p There is a tradeoff between the interval of checkpointing and the amount of recomputa- 

tion needed upon failure. In some embodiments (based upon the current state of storage technol- 
ogy), a greater amount of recomputing is preferable over more frequent checkpointing. Each 
application that uses the framework may decide what is most effective for given hardware and 
software constraints and the application requirements. The decision of how often to checkpoint 

25 is to some degree application-specific. More frequent checkpoints slow down application per- 
formance and less frequent checkpoints require more computation to recover from failure. The 
best checkpoint frequency for each application is determined and used for operation. Another 
factor that affects checkpoint frequency is the publication of results. A checkpoint is also re- 
quired each time results are published outside of the cluster (for example, to a customer). The 
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checkpoint is require because recomputation does not necessarily produce identical results. 
Therefore, once results are published, recomputation is no longer an acceptable recovery strat- 
egy. 

Persistent storage can be distributed in many ways, for example, some embodiments dis- 
5 tribute storage over an entire cluster using RAID technology and other embodiments dedicate 
persistent storage to separate machines. 

The fault-tolerance framework described operates to keep processes running continu- 
ously by providing a hierarchy of monitors that are capable of restarting any failing process or 
migrating processes to different nodes on the network when a hardware or software failure is 
1 0 discovered. The hierarchy also makes sure that the individual monitors are running correctly. 

For applications that use processes that do not require state information (stateless pro- 
CIS cesses), the fault-tolerance framework works well, is fast and does not require persistent storage 
because it is not important where the application is running or what data it was being processed 
" ]i before a failure. An example of an application that uses stateless processes is a web server that 
"15 serves static HTML pages to clients regardless of which pages the client requested previously or 
y of which pages other clients have requested in the mean time. In this example, the fault-toler- 
4~ ance framework need only operate to make sure that an adequate number of web servers are run- 
% i ning to ensure continuous availability of the service. In this example, the same client can be 
served by one server for one request and by another server for another request without any ap- 
*jS) parent change in the service as viewable by the client. 

For applications that use processes that do require state information (stateful processes), 
the fault-tolerance framework works to preserve sufficient state information to enable restarting 
and transfer of processes. An example of an application using stateful processes is a financial 
instrument crossing application in which, for example, stock shares of a buy order and a sell or- 
25 der are matched and crossed (that is, are bought and sold). In such a trading application, a trader 
submits an order to trade shares to an electronic market and, regardless of failure, the order must 
not be lost and must remain active in the system until it is executed, expires or is cancelled. Re- 
strictions on the orders and crossing need to be considered and properly processed even in the 
case of failures in the system during the processing. Also, normal trading rules need to be fol- 
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lowed. For example, the rule must be followed that each share can only be executed once 
against orders of the same kind on each side (buy with sell; sell with buy). 

In order to prevent failures from causing a stateful process from being lost altogether or 
improperly executed, a messaging layer with in the communication unit 45 routes and reroutes 
5 processing to avoid the consequences of failures as they occur. When the fault-tolerance frame- 
work transfers or restarts processes on different nodes, other processes need to be able to reach 
the rerouted or transferred processes after they have been migrated to new nodes. For example, 
if orders of a certain type are initially matched on one node but are subsequently migrated for 
matching to new node, a cancellation message for one of these orders needs to be routed to the 

10 new node automatically. Similarly, a new order for matching must be directed to the new node 
after migration. In operation, the messaging layer processes messages with logical destinations 

H that use a logical-to-physical translation that makes any physical transfer transparent to the af- 

lli fected processes. 

f U When possible, the fault-tolerance framework only restarts processes once it ensures that 

4j> the processes have actually failed. At times, however, there is a trade-off between how quickly a 
^ process can be restarted and how accurately it has been determined that the process has actually 
s failed. In some cases which are intended to be rarely occurring, a process is started or restarted 

G that did not fail so that one or more unintended instance of a process is executing at the same 
3 ;f time as the intended instance. In a stateless system, restarting of a non-failed process or the oth- 
gp erwise starting of unintended duplicate processes is only a minor problem because the result is 
only that one additional process is activated in a non-conflicting way to handle requests. How- 
ever, in a stateful system, a process that is started as a replacement for a process that did not fail 
needs to be handled correctly and to ensure that the unintended duplicate processes do not cause 
data or process corruption. 
25 In order to control the operation of processes in an environment where unintended dupli- 

cate processes may occur, a persistent storage facility is used to store state data that is needed by 
the system to continue processing in an environment where unintended duplicate processes have 
or may occur because of system failures or because of other reasons. The stored state data is 
used, for example, with checkpoints in executing applications and processes to ensure coordina- 
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tion between execution states of the executing processes and the stored state data in the persis- 
tent store. The coordination between the executing process and their known states, the stored 
states in the persistent store and the control algorithms for controlling reliable processing in spite 
of failures and duplicates achieves highly reliable operation and availability. 

In order to ensure that indispensable processes can communicate, a messaging layer is 
used that interfaces with a directory service that is integrated with the fault-tolerance framework. 
The directory service operates to conveniently locate information in the framework thereby en- 
suring that a seamless operation results even in the presence of failures. 

The architecture of the processes 41 of FIG. 3 can advantageously utilize embodiments of 
the communication element 67 of FIG. 2 that interconnects the different nodes and services in 
one or more of the clusters 9. Typically, element 67 includes a different interconnect for com- 
munication local to one node from that of inter-node communications. In addition, inter-cluster 
communications and wide-area communications also likely use different communication mecha- 
nisms. The selection of components for the connection elements 67 is done consistently with the 
architecture of the processes 41 of FIG. 3. 

Local Job Manager - FIG. 4 

FIG. 4 depicts a logical view of the hierarchy of a local job manager 48 ls which is one 
embodiment of the job manager 48 of FIG. 3, together with the local platform 40 including the 
jobs 30 and nodes 51 on which the jobs execute. The nodes 51, including nodes 51-1, 51-N, 
in local platform 40 are any set of all or some of the nodes 51 for the clusters 9 of FIG. 2. These 
nodes 51 in FIG. 4 are implemented using suitable computational devices, such as workstations 
or mainframes, with single-processor or multi-processor configurations. The nodes 51 are the 
resources that are assigned for executing the jobs 30 that perform the services 44 of FIG, 3. 

In FIG. 4, the jobs 30, including jobs 30-1, ... ? 30-J are, for example, programs, threads, 
executable code or data structure tasks that are useful in providing data processing services 44. 
For fault-tolerant operation, the jobs 30 are monitored for proper operation, execution and termi- 
nation. Each job 30 runs on one node 51 and multiple jobs 30 can run on the same node 51 so 
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that there can be a many-to-one mapping of jobs to nodes. In FIG. 4, for example, Job 3 and Job 
4 both run on Node 3 in a two-to-one mapping. 

In FIG. 4, the agents 31, including agents 31-1, 31 -A, are monitors that monitor the 
execution of the jobs 30. One agent 31 can monitor multiple jobs 30 running on multiple nodes 
5 51 or multiple CPUs if a node 51 is implemented with multiple CPUs. Multiple agents 31 can 
monitor different sets of jobs 30 on the same node 51. However, each job 30 is only monitored 
by one agent 31. In one embodiment, each node 51 is associated with only one agent 31 and in 
such an embodiment the monitoring agent is called the host agent. In such an embodiment, the 
host agent 3 1 monitors all jobs 30 running on that node 5 1 . 
10 Each agent 31 includes fault-tolerant code (a) 32 that implements the fault-tolerant oper- 

ation of the agent 31. The fault-tolerant code 32 is implemented in various embodiments to 
monitor proper operation. In one example, the fault-tolerant code 32 makes checks using stan- 
za dard operating system calls to see if the monitored job 30 is still running or if the job terminated 
t[ successfully or unsuccessfully. Such checks (coupled with time-out values) also detect if the 
Si hardware resources as a whole are still available to run the job. However, these checks alone 
J! may not detect deadlocks, infinite loops or other situations in which the code execution of a job 
' is not making sufficient progress towards delivering the desired service. Often, a continuous and 
D explicit indication of progress is needed to detect such failures. Because indications of progress 
fy tend to be application specific, the fault-tolerant code 32 in one embodiment only watches for 
|f) heart-beat messages or other indicators. Each application has code 49 in a service 44 that con- 
O tains the required logic to respond appropriately depending on progress. If a job terminates un- 
expectedly or a resource becomes unavailable, the agent 31 watching the job is responsible for 
restarting that job either on the same one of the nodes 51 on which it was running before or on an 
alternate one of the nodes 5 1 . 
25 The code 32 for the agents 31-1, 31-A includes a suicide protocol that operates only 

on the logical level of agents 31. Each hierarchy level in the fault-tolerance unit 46 x uses this 
suicide protocol and each job 30 is only monitored by exactly one agent 31. The FIG. 4 embodi- 
ment only has a local level corresponding to local coordinator 33. Additional levels are possible 
as described in connection with FIG. 5. 
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In FIG. 4, the local coordinator 33, monitors the executions by the agents 31-1, 31-A 
and executes the suicide protocol in the fault-tolerant code 34. The local coordinator 33 moni- 
tors one or more agents 31. Should an agent 31 fail, the local coordinator 33 in charge of that 
agent 31 is responsible for restarting the failing agent 31. In turn, each particular agent 31 
watches its corresponding local coordinator 33. In a case where the watched local coordinator 
33 fails, the corresponding particular agent 3 1 being watched that detects the local coordinator 
33 failure, restarts that local coordinator 33 or some alternate local coordinator such as local co- 
ordinator 33'. 

The number of agents 31 used to monitor jobs 30 relative to the number of jobs 30 exe- 
cuting depends on many factors. In one embodiment, one agent 31 is present for each node 51. 
This allocation is desirable because, with such a configuration, a local job failure can be detected 
and corrected faster and cheaper (in terms of resources) because no network or external I/O oper- 
ation is needed. Similar benefits are derived from having an agent 3 1 only monitor a few nodes 
51. An important benefit results from having one local coordinator 33 monitor many agents 31 
on different nodes because the agents 31 are collectively responsible for keeping their cor- 
responding local coordinator 33 alive. The likelihood of proper detection and correction of such 
a local coordinator 33 fault increases, because it is more likely that at least one of many agents 
31 will be healthy to notice the failure. It is often useful to have one local coordinator 33 per 
major application. If the resources are to be shared among multiple parties, each hierarchy level 
can allocate resources to specified parties. This allocation on a per party basis allows for full 
fault-tolerance and load-balancing benefits allocated on a per party basis where for a single het- 
erogeneous cluster it is guaranteed that each node is only used by one allocated party at a time, 
thereby effectively constructing a dynamic wall between parties. This configuration is useful for 
providing allocated services via an application service provider (ASP) running in a cluster envi- 
ronment shared by multiple parties while providing each party with a separate service level guar- 
antee in terms of the amount of dedicated resources that are allocated. 

Hierarchical Job Manager - FIG. 5 
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FIG. 5 illustrates job manager 48 in FIG. 3 in a multi-level hierarchical embodiment. In 
FIG. 5, the different hierarchical levels (namely, local, group and universal) connect from local 
coordinator 33 at the local level to group coordinator 35 at a group level to a universal coordina- 
tor 37 at a global level. All levels in this hierarchy have a suicide protocol implemented in the 
5 code 34, 36 and 38 of the local coordinator 33, group coordinator 35 and universal coordinator 
37, respectively. 

The group facility 52-1, in FIG. 5 includes local job managers 48-1 u , 48-L 1L and 
platforms 40-1 I l5 40-L 1L The local job managers 48 include the local coordinators 33-1 l l l9 
33-1 j lL that are the same as the local coordinator 33 in FIG. 4. The local job managers 48 
10 include the agents 31-1 1?1(1 , 31-1 UL and so on to the agents 31-1 UL , 31-l UiL that are the 

same as the agents 3 1 in FIG. 4. 
O Each local job manager 48-1 u , 48-Lj L includes an instantiation of a two-level hierar- 

gS chy of monitors where agents 3 1 are one or more first monitors and local coordinator 34 is one 
f :f of one or more second monitors. The one or more first monitors (agents 31) are for monitoring 
'4-5 first operations (for example, jobs 30) and, for any particular one of the first operations that fails, 
4j the one or more first monitors (agents 31) operate for restarting another instance of the particu- 
!L lar one of the first operations. The one or more second monitors (local coordinator 34) are for 
H monitoring the first monitors (agents 31) and, if any particular one of the first monitors fails (for 
j= example, agent 3l-A l x l ) 9 the one or more second monitors (local coordinator 33-1, l x ) operate 
243 for restarting another instance (another agent 31, for example, agent 31-1 1U ) of the particular 
one of the first monitors. 

The platforms 40 include the jobs 30-1, ?u , 30-l UiL and so on to the jobs 30-l l l L , 
30-1 ! 1L that are the same as the jobs 30 in FIG. 4. The platforms 40 include the nodes 51-l Ufl , 
51-li 1L and so on to the nodes 51-1, 1>L , 51-1 UL that are the same as the nodes 51 in FIG. 
25 4. In the embodiment described, the group facilities 52 of FIG. 5 have the job manager and 
platform architecture of FIG. 4. In an alternate embodiment, other architectures for the group 
facility 52 are possible on the local level while retaining the overall hierarchical structure for 
group and/or universal levels. This alternate embodiment is useful, for example, for integrating 
existing legacy systems into the multi-level hierarchy of FIG. 5. 
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In FIG. 5, the group coordinators 35 are responsible for monitoring the local coordinators 
33. Accordingly, in one embodiment, the local coordinators 33 are monitored by the group coor- 
dinators 35 as well as by the agents 31 (as described in connection with FIG. 4). In alternate em- 
bodiments, monitoring of the local coordinators 33 is by one or the other of the group coordina- 
tors 35 or agents 31. 

Each group facility 52 and group coordinator 35 includes an instantiation of a three-level 
hierarchy of monitors where agents 31 include one or more first monitors, local coordinators 34 
include one of one or more second monitors and group coordinator 35 includes one of one or 
more third monitors. The third monitors (group coordinator 35) operate for monitoring the one 
or more second monitors (local coordinators 33) and, for any particular one of the second moni- 
tors that fails, the third monitors operate for restarting another instance of the particular one of 
the second monitors. The particular one of the third monitors (local coordinator 35-1 j) that mon- 
itors the particular one of the second monitors (33-1 U1 ) that fails runs on the same node (for ex- 
ample node 51-1 w ) or a different node (for example, node 51-N U ) than the node (node 51- 
1 UJ ) where the particular one of the second monitors that fails runs. 

Clusters 9 have platforms 40 that are grouped for monitoring in different ways. A group 
of clusters can consist of multiple local clusters at one location (for example, in the same build- 
ing complex) or can be widely distributed at locations around the world. The content and orga- 
nization of groups is described in connection with FIG. 1. Further to the discussion in connec- 
tion with FIG. 1, a group can, for example, consist of a single application that runs on different 
clusters. A group also can run a set of applications made available to a single customer. It is 
then possible to provide services to different customers at widely distributed data centers rather 
than at one centralized location. 

The universal coordinators 37 monitor the group coordinators 35 and they work together 
in the same way as the group coordinators 35 and the local coordinators 33 in that they each op- 
erate with a suicide protocol in code 38 and can detect and recover failures at the immediately 
lower level of the hierarchy. The universal coordinators 37 also are monitored by the lower level, 
in this case the corresponding group coordinators 35. The universal coordinators 37 are useful 
for monitoring an entire e-commerce system and are at the root of the hierarchical system and 
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hence provide a good starting point for human supervision. Again, it is possible to have multiple 
universal coordinators, for example, one for applications that are not mission critical (such as e- 
entertainment) and others for mission critical applications (such as e-commerce and financial 
markets). A failure of a universal coordinator does not mean a failure of the entire e-commerce 
5 system within the hierarchy of the universal coordinator but merely the failure of a monitor in 
that hierarchy. At each level, the location and number of the groups can be chosen wisely to 
help avoid potential bandwidth restrictions and network delays. 

In FIG. 5, the two-level relationship between the agents 31 and local coordinators 33 is 
the relationship of first and second monitors. Similarly, the two-level relationship between the 
10 local coordinators 33 and group coordinators 35 is the relationship of first and second monitors 
and the two-level relationship between the group coordinators 35 and the universal coordinators 
37 is the relationship of first and second monitors. 

%\ Hierarchical Job Manager - FIG. 6 

Iff FIG. 6 illustrates another representation of job manager 48 in FIG. 3 in a multi-level hier- 

4;* archical embodiment like that of FIG. 5. In FIG. 6, each of the different hierarchical levels are 
" shown aligned horizontally across the page including the universal level of universal coordina- 
te tors 37, the group level of group coordinators 35 and the local level of local coordinators 33. 
ffj In FIG. 6, the universal coordinators 37-1, 37-2, 37-U include the code 38-1, 38-2, 

|0 38-U, respectively, each operating with a suicide protocol. Universal coordinator 37-1, by way 
£3 of example, is the root of the group coordinators 35-1, 35-2, 35-U which include the code 38- 
1, 38-2, 38-U, respectively, each operating with a suicide protocol. Group coordinator 35-l ls 
by way of example, is the intermediary root of the local coordinators 33 in the group facility 52- 
lj which in turn include the instances of code 34, each instance operating with a suicide proto- 
25 col. Each local coordinator 33 is the intermediary root of corresponding agents which in each 
include instances of code 34 as indicated in FIG. 4, each instance operating with a suicide proto- 
col. 

The group facility 52-1 x in FIG. 6 is like the group facility 52-1 x in FIG. 5. 
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Cluster Groups. Example I - FIG. 7 

FIG. 7 depicts an example of a snapshot in time of an implementation of the hierarchy 
described in FIG. 5 and FIG. 6. The nodes 51(N) represent a subset of nodes, like the nodes 51 
described in connection with FIG. 2 through FIG. 6, that are the hardware resources situated in 
5 two or more groups 5 where the groups 5 are of the type described in FIG. 1. In FIG. 7, the 
nodes 51(N) are in two groups named GROUP MEMBER G7, including node U and node V, 
and GROUP MEMBER GN, including nodes W 5 X ? Y and Z . Each vertical line originating at 
one of the nodes 51(N) in FIG. 7 represents a module of computer code executing on that node. 
For example, node U has three jobs (J) and one agent (A) executing as four different modules 

10 while node V has two jobs (J), one agent (A), one local coordinator (L) and one group coordina- 
tor (G) executing as five different modules. The universal coordinator U is executing on an ad- 
ditional node not shown in FIG. 7. In the embodiment of FIG. 7, the code for the G ? L and A 

fc j3 levels of group member G7 are logical in nature in that physically they execute on the same node 
(NODE V) as other processes J. The group member Gl processes with multiple levels G, L, A 

IjS and J have all executing code sharing the same physical resources of a common node (NODE 

2 v). 

- FIG. 7 shows an example with an agent 31(A) executing on node U that monitors three 

O jobs (J) also executing on node U. FIG. 7 also shows another agent 31(A) executing on node Y 
r y and monitoring jobs (J) on multiple nodes, specifically one job (J) on node Y and two jobs (J) on 
2| node Z. FIG. 7 further shows a local coordinator 33(L) executing on node V while monitoring 
p agents 31(A) with one agent (A) executing on node V and one agent (A) executing on node U. 
Because executing code often shares the same nodes, it is possible that the failure of a single ma- 
chine (for example NODE V) will bring down an entire sub-tree of the hierarchy of FIG. 7. In 
such a situation, the recovery may require multiple steps or, in this case, a single step will re- 
25 cover from multiple failures. However, it is possible to entirely eliminate such situations by as- 
signing certain hierarchy levels to a disjoint set of nodes as described in connection with FIG. 9. 
The advantage of the implementation in FIG. 7 is that there are no restrictions on where any 
code can execute and each level of the hierarchy is very close to the next lower level so that no 
major communication overhead is required. 

Attorney Docket No.: ATAE1015DEL Express Mail Label No.:EL328296286US 

1015_00^07 A 20.fi.wpd Page 24 of 94 7/20/0-22:31 



Cluster Groups, Example II - FIG. 8 

FIG. 8 depicts an example of a snapshot in time of an implementation of the hierarchy 
described in FIG. 5. The nodes 51(N) represent a subset of nodes, like the nodes 51 described in 
connection with FIG. 2 through FIG. 6, that are the hardware resources situated in one or more 
5 groups 5 of the type described in FIG. 1. Each vertical line originating at one of the nodes 51(N) 
in FIG. 8 represents code in a code module executing on that node. For example, node X has 
two jobs (J) executing in two code modules and one job (Jx), one agent (A^ and one local coor- 
dinator (L x ) all executing as part of a single code module. Node Y has a universal coordinator 
(U Y ), a global coordinator (G Y ) and an agent (A Y ) all executing as code in a single module and 
10 has one job (J) executing as a separate process in a separate code module. Node Z has two jobs 
(J) executing in two different code modules. In the embodiment of FIG. 8 for the Node Y, the 
^ different levels of U Y? G y , L Y and A Y are logical in nature in that physically they execute on the 
■€i same node (NODE Y) along with another process (J) and hence all share the same physical re- 
f|| sources. Logically, the different levels of elements of U Y? G Y , L Y and A Y are vertically hierarchi- 
f5 cal in that U Y monitors G Y , G Y monitors L Y , L Y monitors A Y and A Y monitors J. The physical 
nodes and the code modules that are selected for the elements of U Y? G Y , L Y and A Y are deter- 
s mined as part of the system design where factors considered in making the selection include 

in node availability, fault tolerance and load balancing. 

m In the FIG. 8 example, an agent A x executing on node X monitors three jobs (J, J, Jx) 

H executing on the same node (NODE X). In the FIG. 8 example, another agent A Y executing on 
w node Y monitors jobs (J) on multiple nodes, specifically one job (J) on node Y and two jobs (J) 
on node Z. In the FIG. 8 example, local coordinator L x executing on node X monitors agent A x 
and local coordinator L Y executing on node Y monitors agent A Y . As is evident from the FIG. 8 
example, executing code often shares the same nodes and the same code modules so that it is 
25 possible that the failure of a single machine (for example NODE Y) or a single code module will 
bring down a substantial portion of the hierarchy of FIG. 8. In such a failure situation, the recov- 
ery may require multiple steps. However, it is possible to entirely eliminate such situations by 
assigning certain hierarchy levels to a disjoint set of nodes as described in connection with FIG. 
9. The implementation in FIG. 8 has the advantage that there are no restrictions on where any 
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code can execute and each level of the hierarchy is close to the next level so that no major com- 
munication overhead is required. 

Cluster Groups, Example III - FIG. 9 
5 FIG. 9 depicts another example of a snapshot in time of an implementation of the hierar- 

chy described in FIG. 5 with a different allocation of monitor elements than in FIG. 7 and FIG. 
8. In FIG. 9, the nodes 51(N) are in two groups named GROUP MEMBER Gi, including node 
U and node V like in FIG. 7, and GROUP MEMBER GN, including one or more nodes L and A 
and including multiple nodes (J). The GROUP MEMBER GN in FIG. 9 differs from GROUP 

10 MEMBER GN in FIG. 7 in that in FIG. 9, the monitors at different levels are grouped at the 
same node, that is, the local coordinators 33(L) are both located on one or more L nodes, the 

r j agents 3 1 (A) are located on one or more A nodes and the jobs 30(J) are located on one or more J 

S nodes (J NODE 1, J NODE J). 

r j{ In FIG. 9, a specific set of nodes 51(L) for GROUP MEMBER GN is dedicated to run 

15 local coordinators (L) only. If one of the local coordinators L fails, agents (A) and the group co- 
ordinator (G) are only allowed to start a new local coordinator (L) on these dedicated L nodes. 
I ^ Typically, three nodes are sufficient to provide n+1 failure capabilities such that if one node is 
M down for service and one node fails, the third node can still perform the job. Any number of 
jl nodes is possible. The principle of dedicated nodes for a level in the hierarchy can apply to all 
f© hierarchy levels where the use of L nodes for the L level of the hierarchy is extendable to G 
nodes for the G level, to A nodes for the A level and so on such that each level includes one or 
more dedicated nodes for that level. The universal coordinator U is executing on an additional 
node not shown in FIG. 9. In the FIG. 9 embodiment, for example, the nodes 51(A) for the agent 
level A are dedicated to running agents (A). In another embodiment, a combination of the dedi- 
25 cated and non-dedicated examples of FIG. 7 and FIG. 9 are employed in the same hierarchy. For 
example, the dedicated allocation in FIG. 9 can be applied only to the local coordinators L while 
an agent (A) appears on each node so that agents are not dedicated to any particular node. Such 
an embodiment helps prevent small errors from propagating to the group level while still allow- 
ing tree structures in part of the hierarchy. 
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Failure-Recovery: Single Job Failure - FIG. 10 

FIG. 10 illustrates the case of failure detection and recovery where the failure is a single 
job (J) failure. In FIG. 10, job 2 (30-DOWN) is assumed to have failed and prior to failure was 
running on node 51-Y. Agent 31-1, which was monitoring job 30-1 on node 51-X as well as job 
30-DOWN on 51-Y, detected the job 2 (30-DOWN) failure. The failure may have been caused 
by the failure of the entire node 51-Y or by any other cause. As soon as the agent 31-1 detects 
the failure, agent 31-1 immediately restarts the failed job, perhaps on node 51-Z if it is assumed 
for purposes of example that the entire node 51-Y failed. This restart is indicated by the broken 
line from node 51-Y to node 51-Z in FIG. 10. The restarted job on node 51-Z is labeled T (30- 
UP) because it is a new instance of the old job 2. In the general case as shown in FIG. 10, the 
failing node 51-Y is different from the restart node 51-Z. However, in one embodiment, a single 
host agent (A) on each node monitors all jobs since job failures are not anticipated to be due to 
failure of an entire node. In such a host agent embodiment, the host agent restarts the failing job 
2 on the same node 51-Y that the job 2 was running on prior to the failure provided the node 51- 
Y is able to receive a restarted job. 

The distributed resource management unit 46 of FIG. 3 (including the entire hierarchy of 
monitoring operations for agents, local coordinators, group coordinators and universal coordina- 
tors) monitors jobs at the application level. Because resource management unit 46 is only con- 
cerned about the health of a job, the cause of the failure is irrelevant and it does not matter 
whether the entire resource failed or if only an application failed. In either case, the goal of 
completing the job was not achieved. In order to prevent undesirable results from cases where a 
non-responding job is wrongfully assumed to have failed, the operation of the persistent storage 
facility 46 3 effectively intervenes because it only accepts checkpoints and other data writes from 
jobs that are in good standing and rejects other jobs. When multiple instances of the same job 
are running, only one instance of the job is actually allowed to modify the contents of the persis- 
tent storage facility 46 3 . Effectively and as soon as duplicate jobs are detected, the duplicates are 
killed. The condition of more than one instance of the same job running arises, for example, 
when a job is restarted based upon a wrong determination that the job failed and therefore both 
the non-failed job and the restarted job are concurrently present until the duplicate is killed. 
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Also, it is possible for duplicate jobs to occur when there are multiple monitoring agents for a 
single job. When more than one agent determines that a job has failed and both agents initiate a 
restart of the job before realizing that another agent has already restarted the job, duplicate jobs 
are started. For this is reason, one preferred embodiment has only one agent monitoring any par- 
ticular job. 

Failure-Recovery: Vertical Failure - FIG. 1 1 

FIG. 1 1 represents a generalized vertical failure condition in a hierarchy. A vertical fail- 
ure can occur for the entire tree from the universal level to the job level or for any sub-tree of the 
hierarchy. In FIG. 11, three levels of a hierarchy are shown, namely levels 90, 91 and 92. The 
procedures for fault detection and correction are preferably the same at each level and if so, the 
levels 90, 91 and 92 can represent any of the following sequences of levels: job, agent and local 
coordinator; agent, local coordinator and group coordinator; or local coordinator, group coordi- 
nator and universal coordinator. In the first case, the suicide module 93 does not necessarily exist 
(for jobs), whereas in the other cases, the suicide module is present. 

In FIG. 11, the example assumes that a vertical failure involving all of the items Q, R, S 
and T has occurred. The vertical failure can happen if an allocation of the type described in FIG. 
7 is employed. In such an allocation, a single node failure, for example node V, will cause mul- 
tiple layers of monitors and jobs running on this node to fail. There are two different procedures 
that can detect the vertical failure: 1) at each level, if there is an alive item that was watched by 
one of the now dead monitors, it will start up a new monitor and 2) if the parent of the failing 
sub-tree is alive, it can restart its child. 

For generality, the first case is shown in FIG. 1 1 where Item 91-IINT lost its parent 92- 
DOWN and restarts it. As soon as a new instance (92-UP) of the monitor 92-DOWN is alive 
again, it detects (by recovering its state from the persistent storage unit 46 3 of FIG. 3) that it was 
watching a process that is no longer present, namely 91 -DOWN, a peer of 91-IINT. So 91-IINT 
immediately restarts the job 91 -DOWN. As soon as the job 91 -DOWN is running again, it also 
detects two children that are missing, namely 90-DOWN including DOWN S and DOWN T. So 
91-IINT restarts 90-DOWN including DOWN S and DOWN T as well. 



Attorney Docket No.: ATAE1015DEL Express Mail Label No.:EL328296286US 

ioi5_oo A 07 A 20.fi.wpd Page 28 of 94 7/20/0-22:31 



As indicated in FIG. 1 1, Q', R\ S' and T are the new instances that replace Q, R, S and 
T, respectively. Naturally, like in the example of FIG. 10, each of the new instances of jobs and 
monitors can be started up on the same node or on a different node. In the one embodiment, 
workload information about each node, stored in persistent storage unit 46 3 of FIG. 3 or other- 
wise available, is used in determining where to start up new jobs and where to restart jobs that 
have failed. Generally, it is faster to recover from horizontal failures because it is a one-step 
process. In comparison, vertical failures need to recover each level in the failed hierarchy. This 
difference between horizontal and vertical failures suggests that a vertical hierarchy such as il- 
lustrated in FIG. 9 is preferable to the horizontally integrated hierarchy depicted in FIG. 10. 
However, vertical hierarchies provide a slightly better resource usage in the case of failure-free 
operation and the proper arrangement therefore can be decided on a case-by-case basis, after ex- 
amining the network latencies and other factors involved. 

Failure-Recovery: Horizontal Failure - FIG. 12 

FIG. 1 1 illustrates a generalized horizontal failure. In FIG. 12, three levels of a hierarchy 
are shown, namely levels 90, 91 and 92. The procedures for fault detection and correction are 
preferably the same at each level and if so, the levels 90, 91 and 92 can represent any of the fol- 
lowing sequences of levels: job, agent and local coordinator; agent, local coordinator and group 
coordinator; or local coordinator, group coordinator and universal coordinator. In the first case, 
the suicide module 93 does not necessarily exist (for jobs), whereas in the other cases, the sui- 
cide module is present. 

In FIG. 12, for purpose of an explanation it is assumed that the items R and S have failed. 
This failure could happen if a setup as described in FIG. 9 is used. In such a setup, a single node 
failure can cause multiple items of the same hierarchy level to fail There are two different cases 
where procedures can detect the failure: 1) at each level, if there is an alive item that was 
watched by one of the now dead monitors, it will start up a new monitor and 2) alternatively, if 
the different parents of the failing level are alive, they can restart their children. 

Both cases are shown in FIG. 12. All of the items 90-IINT lost their parents and at the 
same time, item 92-IINT lost its children. Each of the items 90-IINT and the item items 92-IINT 
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can immediately restart the dead monitors 91. As indicated in FIG. 12, R' and S' are the new 
instances that are replace R and S. Naturally, like in the example of FIG. 10, each of the new 
instances of jobs and monitors can be started up on the same node or on a different node. In the 
one embodiment, workload information, stored in persistent storage unit 46 3 of FIG. 3 or other- 
wise available, about each node is used in determining where to start up new jobs and where to 
restart jobs that have failed. 

Failure-Recovery: Conflict Situation - FIG. 13 

FIG. 13 provides representation of a generalized conflict situation when restarting a mon- 
itor. In FIG. 13, three levels of a hierarchy are shown, namely levels 90, 91 and 92. The proce- 
dures for fault detection and correction are preferably the same at each level and if so, the levels 
90, 91 and 92 can represent any of the following sequences of levels: job, agent and local coordi- 
nator; agent, local coordinator and group coordinator; or local coordinator, group coordinator 
and universal coordinator. In the first case, the suicide module 93 does not necessarily exist (for 
jobs), whereas in the other cases, the suicide module 93 is present. 

FIG. 13 shows a similar view to the one in FIG. 12 except that in FIG. 12 multiple items 
detect a single failure and try to correct it independently. The result is that multiple monitors 
monitoring a job can possibly interfere with each other. 91-UP1, 91-UP2, and 91-UP3 are equiv- 
alent monitors of the same hierarchy level, and exactly this interference situation should be 
avoided. To achieve this goal of interference avoidance, a protocol is required, especially be- 
cause each of these monitors can potentially be located on different nodes or, in higher levels of 
the hierarchy, even possibly at remote locations around the world. 

These possibilities of interference are corrected through use of the suicide modules. The 
suicide modules announce their existence and check for heartbeats from their peers. If it turns 
out that multiple monitors are monitoring the same job, all but one monitor will commit suicide. 
In one embodiment, a uniqueness indicator, such as the unique NIC (network interface card) ID 
or IP address, is used and only the monitor running on the node with the highest uniqueness indi- 
cator stays alive, while all other equivalent monitors commit suicide. However, since it is possi- 
ble that multiple instances of the same monitor will get started on the same node so that the 
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node-based ID does not establish uniqueness, then the unique process ID for the monitor is used, 
by way of one example, to keep only the monitor with the lowest process ID, the monitors with 
higher process ID's committing suicide. 

In order to avoid having to use suicide protocols frequently, one embodiment also uses 
5 methods to avoid such multiple redundant monitors form being started. In one method, when a 
failure is detected, every item that detects the failure applies a random back-off delay before at- 
tempting to start a new monitor. If there is still no heartbeat message after the back off, the re- 
start is triggered. In practice it was shown that the back-off is an effective method for avoiding 
multiple instances of redundant monitors. In an additional method, all peers are notified via a 

10 broadcast message that the restart of a monitor has taken place. As soon as the other peers re- 
ceive this broadcast message, they stop their restart attempts and start sending heartbeats again. 
In one embodiment, different back-off times were used for the local area and the wide area 

%o which, among other things, compensates for greater latency due to longer communication paths 

fij and times. 

Q In one implementation, the monitors initiate all heartbeat activity with the level below 

4- them. The absence of a heartbeat from the monitor alerts the monitored level to initiate recovery 
. of the monitor. Duplicate monitors discover each other as they poll (heartbeat) the monitored 
J-j processes. This polling is how the election process is effected. Each monitor sends it's election 
flj value to each monitored process. In the case of the coordinator/hostagent, the coordinator sends 
2£j its election value to each host agent when it polls for heartbeat. The hostagent compares this 
W election value with the one from previous polls and keeps the "best" one. This best value is re- 
turned in its responses to coordinator polls. A coordinator receiving a "better" election value 
back from a hostagent executes its suicide function since there is another coordinator running 
that has a "better" election value. Once a coordinator has polled all hostagents on the network, it 
25 can be sure it is the only coordinator left running. Until a new coordinator has completed one 
complete poll cycle, it behaves in a passive way. It does not perform recovery of other compo- 
nents and it does not perform load balancing. 
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Market Engine - FIG. 14 

The clustering system 2 of FIG. 1 is beneficially employed to implement components of 
a market engine in FIG. 14. Details of market engine having the components of FIG. 14 are de- 
scribed in the above-identified cross-referenced application entitled MARKET ENGINES HAV- 
ING EXTENDABLE COMPONENT ARCHITECTURE. The components 71-1, 71-2, 71- 
Co. are interconnected by connection element 67. The connection element 67 is a logical entity 
that provides the necessary physical interconnection and protocol for each of the components 71 . 

In FIG. 14, the components 71 include, for example, a routing component 71-1, a trigger 
component 71-2, a crossing component 71-3, a scripting component 71-4, a stock component 71- 
5, a bond component 71-6, a currency component 71-7, an options component 71-8, an account- 
ing component 71-9, a TI interface component 71-10, a T.P. interface component 71-11, a DA 
interface component 71-12, a storage component 71-13, a supervisor component 71-14 and other 
components 71-15, 71-Co. One or more or all of the components 71 of FIG. 14 are imple- 
mented as services in the service unit 44 of FIG. 3. In this manner, the hierarchical fault toler- 
ance described in the embodiments of the present invention are applied to the market engine 
components of FIG. 14. 

E-commerce System -FIG. 15 

FIG. 15 depicts an e-commerce system that employs the fault-tolerance framework previ- 
ously described for performing e-commerce transactions. Transactions in the system of FIG. 15 
are initiated, in some instances, with transaction initiators 10 in transaction units 11, including 
units 11-1, 11-Tu where each of the units 11 includes transaction initiators lO'-l, 10-TI. 
Transactions are processed, in some instances, in transaction processors 12. The transaction ini- 
tiators 10 and the transaction processors 12 are collectively referred to as transaction units 7. 
The transaction initiation and processing is supervised by one or more market engines 95, desig- 
nated as market engines 95-1, 95E. In some embodiments, one or more of the market engines 
95 are capable of initiating and processing transactions internally, having the equivalent of trans- 
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action initiators 10 and/or transaction processors 12 internal to the market engine, and are then 
characterized as integrated market engines. 

In FIG. 15, the transaction initiators 10 are, for example, users that include computers, 
terminals and other equipment and software useful for persons (individuals or companies) to 
5 electronically connect to an e-commerce system. Alternatively, the transaction initiators may be 
brokers. Brokers include computers, terminals and other equipment and software useful for per- 
sons (individuals or companies) acting as brokers for users to electronically connect to an e-com- 
merce system. The transaction initiators in FIG. 15 can be of the user-only type of transaction 
initiator, can be of the broker-user type of transaction initiator or can be of any other type. Any 

10 number of such transaction initiators 10 of different types can be used in an electronic system of 
FIG. 15 for initiating electronic transactions. As additional examples, hierarchies of brokers, 
funds, institutions and users are included, such as broker-broker, user-user, broker-broker-user- 

^ user. A hierarchy in any depth or configuration can exist. 

%l The market engines 95 respond to initiated transactions and supervise interaction among 

lp the transaction initiators 10, the transaction processors 12 and the different market engines 95 to 
£ control the routing of the initiated transactions, the processing of transactions and the coordinat- 
y ing, gathering, storing and distributing of information useful for transaction supervision and pro- 
q cessing. In some embodiments, historical data is used in this routing process to take advantage 
n\ of statistical patterns in the processing performed in external transaction processors. Such his- 
2§ torical data includes execution price and depth of the market among others things. 
O In tile FIG. 15 system, the market engines 95 are able to access and maintain information 

about transactions collectively as well as about each of the individual transactions being pro- 
cessed in the market engines 95. Where high reliability in transaction handling is required, the 
connections among transaction units 7 and market engines 95 are redundant or are otherwise 
25 configured to ensure high reliability and high availability using the fault-tolerance framework 
previously described. 

In the FIG. 15 system, connections among the transaction initiators 10, the transaction 
processors 12 and the different market engines is generically shown through networks 13, but it 
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is to be understood that such connections in networks 13 can include direct connections among 
the transaction initiators 10, the transaction processors 12 and the different market engines 95. 

In FIG. 15, the transaction processors 12 include one or more conventional (or non-con- 
ventional) exchanges 24. The exchanges include, for example, conventional exchanges 24-1, 
24-EX, which are, for example, the New York Stock Exchange (NYSE), Chicago Mercantile 
Exchange, National Association of Securities Dealers Automated Quotation System (NASDAQ), 
and other similar exchanges. In the FIG. 15 embodiment, the transaction processors include the 
alternative trading systems (ATS) and particularly, ATS 26-1, 26-AT. The transaction pro- 
cessors also include electronic communication networks (ECN) including the ECN 25-1, 25- 
EC. Any number of other transaction processors 27 are possible in the transaction processors 12 
of FIG. 15, and these are generically indicated as the other transaction processors 27-1, 27- 
OT. Other transaction processors include Clearing Houses for example. Some of the transaction 
processors 12 in FIG. 15 include data components for receiving or providing data relevant to 
transactions and these data components are designated as the data components 28-1, 28-DA. 
Such data components typically provide information about one or more of the other transaction 
processors such as the exchanges 24, ECNs 25 and the ATSs but also can provide any other type 
of data such as weather data, company earnings, political and economic data and so forth. Also, 
the data components may store data, provide data for quotations and otherwise act in any capac- 
ity to serve or receive data of all types. 

In FIG. 15, the functional flow of information is shown by broken lines, while physical 
connections of the transaction initiators 10, market engines 95 and transaction processors 12 are 
generally through direct connections to the network 13 as shown by solid lines. 

Local Job Manager With Single Host Agent - FIG. 16 

FIG. 16 depicts a logical view of the hierarchy of a local job manager 48, which is one 
embodiment of the job manager 48 of FIG. 3, together with the local platform 40 including the 
jobs 30 and nodes 51 on which the jobs execute. The nodes 51, including nodes 51-1, 51-N 
in local platform 40, are any set of all or some of the nodes 51 for the clusters 9 of FIG. 2. These 
nodes 51 in FIG. 16 are implemented using suitable computational devices, such as 
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workstations, servers or mainframes, with single-processor or multi-processor configurations. 
The nodes 51 are the resources that are assigned for executing the jobs 30 that perform the ser- 
vices 44 of FIG. 3. 

In FIG. 16, the jobs 30, including jobs 30-1, 30-J are, for example, programs, threads, 
executable code or data structures that are useful in providing data processing services 44. For 
fault-tolerant operation, the jobs 30 are monitored for proper operation, execution and termina- 
tion. Each job 30 runs on one node 5 1 and multiple jobs 30 can run on the same node 5 1 so that 
there can be a many-to-one mapping of jobs to nodes. In the example of FIG. 16, Job 1 runs on 
Node 1, Job 2 runs on Node 2, Job 3 and Job 4 both run on Node 3 and Job J runs on Node N. 
By way of one example, a job 30 may be part of a crossing engine implementation which func- 
tions to cross buy and sell orders for financial instruments and may be allocated for a particular 
symbol (such as IBM, Intel or other traded stock instruments). That is, in one embodiment, one 
job may cross shares of IBM, another job may cross shares of Intel, a still another job may cross 
shares of Inktomi and so on. In another embodiment, a single job may cross shares of IBM, 
shares of Intel, shares of Inktomi and so on. 

In FIG. 16, the local job manager 48 is implemented as code modules including the 
Coordinator.) ava, Cluster.java, JobEntry.java, Node.java, and HostAgentjava modules. In the 
local job manager 48, the Coordinator.java, Cluster.java, and multiple instances of the 
JobEntry.java and Node.java modules are part of the local coordinator 33. Multiple instances of 
the HostAgentjava module are used in multiple instances of the Host Agent. In the example 
described, the Java language is employed for the modules but any other language such as C/C++ 
can also be used to avoid the additional complexity introduced by the Java Virtual Machine 
(JVM). In one embodiment, the fault-tolerance framework is compiled to machine code. The 
implementation of the host agent is desirably kept simple (and therefore reliable) because it is 
the most important component (software and hardware) in the system for achieving overall sys- 
tem reliable operation. 

In the example described, system startup occurs when one or more machines start run- 
ning HostAgents. To start the HostAgents, a shell script HostAgentsh is executed. The 
HostAgentsh script is marked as a startup script in the operating system such that it executes 
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automatically whenever the system is rebooted. The HostAgentsh script is also be executed by 
Coordinator .java whenever the coordinator detects that a HostAgent is not responding. The 
HostAgentsh script does the following: 

1) Defines the hostname where the HostAgent is to run 

2) Defines the port at which the HostAgent listens to commands 

3) Sets the permissions for reading and writing to this port among others 

4) Starts MEC.Hydra.HostAgent 

The "MEC.Hydra." prefix identifies all the programs that are in a common package and 
the filename that gets executed is HostAgent.class, which has HostAgentjava as its source code. 

In the example described, after system startup occurs and the shell script HostAgentsh 
has executed, one or more machines start running HostAgents. If no Coordinator exists, the one 
or more HostAgents will initiate one or more Coordinators and one Coordinator will survive and 
take control. Then, the Coordinator.java module manages overall control of the cluster of nodes 
51 in platform 40. The Coordinator.java module initializes the state of the system, handles pa- 
rameters and performs other house-keeping operations. The Cluster.java module maintains the 
state of the cluster of nodes 51 in platform 40. The Coordinator.java module uses the 
Cluster.java module to poll the host agents 31 (or alternatively the nodes 51 directly) for the 
alive status of the nodes 51, tracks where jobs 30 are running and, in certain embodiments, 
moves jobs 30 when their nodes 5 1 become unavailable. The Node.java module maintains node 
level information and an instance of Node.java is initiated for each active node 51. The 
Node.java module tracks which jobs are running on each node 51, the node status and which ser- 
vices are available on each node. The JobEntry.java module manages information about each 
job running in the cluster of nodes 51 in platform 40 and an instance of JobEntry.java is initiated 
for each job and is referenced to the corresponding Node.java instance. 

In FIG. 16, host agents 31, including host agents 31-1, 31-N, are monitored by the 
local coordinator 33 and each HostAgentjava module monitors the jobs that are executing on a 
corresponding node. 

Examples of code modules representing one embodiment of FIG. 16 are included in the 
following lists including HostAgentsh in LIST_1, Coordinator.java in LIST_2, Cluster.java in 
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LIST_3, JobEntry.java in LIST_4, Nodejava in LIST_5, HostAgent-Ljava in LIST_6 and 
HostAgent-2.java in LIST_7. 

The HostAgent-Ljava module is an example used where one job executes crossings for 
multiple symbols and the HostAgent-2 Java module is an example used where multiple jobs are 
run for executing crossings of symbols. The jobs can be allocated with one job per symbol, mul- 
tiple jobs per symbol, one job for an entire service, one job for a group of symbols or with any 
other configuration. When a job manages multiple symbols and such a job becomes unavailable 
on an otherwise functioning node, only the dying subset of the symbols that are processed on the 
functioning node need be restarted. In addition, different types of jobs can be monitored by the 
same HostAgent, for example, a job for a shopping service and a job for a crossing service. The 
Host Agent- l.java module and the HostAgent-2 Java module can each be run separately or to- 
gether, for example, on the same node. If run together, the message terminology can be harmo- 
nized where for example, the sendStopMessage command calls the killJob message and the 
sendStartMessage command calls the startJob command. 
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LIST 1: HostAgentsh 
Copyright 2000 Market Engine Corporation 



1 package MEC.Hydra; 

2 ## 

3 ## This script starts the HostAgent. 

4 ## It starts the RMI registry and host agent code in the background. 

5 m 

6 . 

7 ## First, start the RMI registry. 
8 

9 HOSTAGENT_PORT=3001 

1 0 /bin/rmiregistry $HOSTAGENT_PORT & 
11 

12 ## 

1 3 ## now start the HostAgent server program 
14 

15 PERMIT_FILE=/home/hydra/permit 

H HOSTNAME= , hostname , 

1 

fl /bin/java -Djava.security.policy=$PERMIT_FILE MEC.Hydra.HostAgent 

11 SHOSTNAME 

20 $HOSTAGENTJPORT & 

21 

22 exit 0 

23 m 
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LIST 2: Coordinator. Java 
Copyright 2000 Market Engine Corporation 



1 package MEC.Hydra; 

2 import ...; 
3 

4 import ...; 

5 // Example Coordinator uses Java and implements an RMI interface for communication 

6 public class Coordinator extends UnicastRemoteObject 

7 implements CoordinatorRMI { 

8 // 

9 // Constructors 

10 // 

11 public Coordinator() throws RemoteException { 
tZ super(); 

I ) 

ij 

f S // Private data 

U n 

W private static String Hostname; // Local host of the Coordinator 

II private static Cluster cluster; // Cluster to supervise (coordinate) 

21 

22 // 

23 // Public methods 

24 // 

25 public static String getHostName() { 

26 String hostname; 

27 try { 

28 InetAddress addr = InetAddress.getLocalHostO; 

29 hostname = addr.getHostNameQ; 
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LIST 2: Coordinator Java 
Copyright 2000 Market Engine Corporation 



30 } catch (UnknownHostException err) { 

3 1 hostname = new StringC'localhost"); 

32 } 

33 return hostname; 

34 } 

35 // 

36 // RMI callable methods 

37 // 

3 8 public String startJob(String jobname) throws RemoteException { 

39 String nodename; 

40 //System.out.println("Coordinator: startJob " + jobname); 
% cluster.addJob(jobname); 

4£ nodename = cluster.locateJob(jobname); 

4 5 //System.out.println("Coordinator: startJob: nodename " + nodename); 

44 return new String(nodename); 

O } 

% public String locateJob(String jobname) throws RemoteException { 

4| String nodename; 

49 nodename = cluster.locateJobQobname); 

50 return new String(nodename); 

51 } 

52 // 

53 // Private methods 

54 // 

55 private static void poll() { 

56 while (true) { 
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LIST 2: Coordinator.] ava 
Copyright 2000 Market Engine Corporation 



57 try { 

58 Thread.sleep(15 * 1000); 

59 } catch (InterruptedException err) { 

60 // Ignore 

61 } 

62 // cluster. checkJobsO ; 

63 // Notes or jobs can be checked here every n milliseconds. 

64 // In this case, n = 15 XI 000 = 15 seconds. However, in this example implementation, 

65 // the poller() method in cluster java is used instead. 

66 } 

67 } 

68 public static void run boolean demo { 

§| System.outprintln( n Coordinator must run on Dispatcher node"); 

?0 Hostname = getHostName(); 

71 cluster = new Cluster(Hostname); 

7l // Create and install a security manager 

73 if (System.getSecurityManager() = null) { 

|4 System. setSecurityManager(new 

75 SecurityManager()); 

?I } 

II 

G 

78 try { 

79 String rminame; 

80 Coordinator obj = new Coordinator(); 

8 1 rminame = Parameters.RmiName(Hostname, "Coordinator"); 

82 Naming.rebind(rminame, obj); 

83 System.out.println( M Coordinator: bind " + rminame); 

84 } catch (Exception err) { 

85 System.out.println( n Coordinator err: " + err.getMessage()); 

86 err.printStackTrace(); 

87 } 



88 
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LIST 2: Coordinator java 
Copyright 2000 Market Engine Corporation 



89 if (demo) { 

90 Node node; 

91 // For illustration, some nodes are added here and the 'VECN' job is assigned 

92 // and started on each of the nodes. In deployment, no nodes are explicitly 

93 // assigned but they are rather detected by the coordinator when an agent 

94 // is started on a node. 

95 node = new Node("nasdaq M ? cluster); 

96 node.addProgram( M VECN n ); 

97 cluster.addNode(node); 

98 node = new Node("nyse n , cluster); 

99 node.addProgram( M VECN"); 
1 00 cluster.addNode(node); 

1 8 J node = new Node( M cbo", cluster); 

1 11 node.addProgram(" VECN"); 

1 cluster.addNode(node); 

l6| } 

iq! } 

106 public static void main(String args[]) { 

1(KI run(true); 

1 0? II starts the coordinator and because demo is set to true, it will also start the 

1 d // VECN job on 3 nodes. 

1] S } 

iff } 
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LIST 3 : Cluster.java 
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1 package MEC.Hydra; 

2 import 
3 

4 import 
5 

6 // Cluster is used by Coordinator - Java to keep track of jobs and nodes 

7 public class Cluster extends Thread { 

8 // 

9 // Private data 
10 // 

X% private String Hostname = null; 

12 private Hashtable nodes = null; 

|S private Hashtable j obs = null; 

i| private int nodecount = 0; 

15 private int j obcount = 0; 

t| 

fl // 

18 // Constructors 

IS // 

29 public Cluster (String hostname) { 

51 Hostname = hostname; 

22 nodes = new Hashtable(); 

23 jobs = new Hashtable(); 

24 start(); 

25 } 

26 

27 // 

28 //Node methods 

29 // 

30 // Called when the coordinator detects a new node in the clusters or when it 

31 // explicitly starts a new node in the demo. 
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LIST 3 : Clusterjava 
Copyright 2000 Market Engine Corporation 



32 public void addNode(Node node) { 

33 String nodename - node.getNodeName(); 

34 if (! nodes.containsKey(nodename)) { 

35 nodecount++; 

36 nodes.put(nodename, node); 

37 } 

38 } 

39 // called if the coordinator detects the failure of an entire node or when a node is 

40 // removed for servicing 

41 public void removeNode(String nodename) { 

42 if (nodes.containsKey(nodename)) { 

43 nodes.remove(nodename); 

44 nodecount--; 

¥ } 

f } 

4$ public Node getNode(String nodename) { 
48 if (nodes.containsKey(nodename)) { 

4§ return (Node) nodes.get(nodename); 

5d } else { 

SI return null; 

ki } 

i } 

G 

55 // Job methods 

56 // called whenever the coordinator starts (and begins monitoring) a new job 

57 // anywhere on the cluster. 

58 public void addJob(String jobname) { 

59 if (! jobs.containsKey(jobname)) { 

60 jobcount-H-; 

61 jobs.put(jobname 5 new JobEntry(jobname)); 

62 } 

63 } 

64 // called when a job terminated successfully or is no longer watched by 

65 // the coordinator. 

66 public void removeJob(String jobname) { 
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67 if (jobs.containsKey(jobname)) { 

68 JobEntry job = (JobEntry) jobs.get(jobname); 

69 stopJob(job); 

70 jobs.remove(jobname); 

71 jobcount-; 

72 } 

73 } 

74 public int counUobs() { 

75 return jobcount; 

76 } 

77 public String locate Job (String jobname) { 
7& String nodename = null; 

T§\ //System.out.println("Cluster: locateJob: " + jobname); 

8p| if Qobs.containsKey(jobname)) { 

it //System.out.println( M Cluster: locateJob: ok M ); 

821 JobEntry job = (JobEntry)jobs.get(jobname); 

8J* nodename = job.getNodeName(); 

m } 

^ //System.out.println("Cluster: locateJob: nodename " + nodename); 



return nodename; 

8| } 

m 

89 // 

90 // private methods 

91 // 

92 // Find a running node that can accept more jobs. This version 

93 // finds the running node that has the fewest jobs. (For equal load distribution 

94 // upon startup). 

95 private Node fmdNode(String Progname) { 

96 Enumeration en; 

97 Node bestnode = null; 

98 for (en = nodes.elements(); en.hasMoreElements(); ) { 

99 Node node = (Node)en.nextElement(); 
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LIST 3 : Cluster Java 
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100 if (node.isRunning()) { 

101 if((bestaode = nuU)|| 

1 02 (node.countJobs() < bestnode.countJobs())) 

1 03 bestnode = node; 

104 } 

105 } 

1 06 return bestnode; 

107 } 

1 08 // Find a running node that can accept more jobs. This version 

1 09 // finds the running node that has the fewest jobs. 

1 1 0 private Node findNode() { 
1 fc| Enumeration en; 

1 13 Node bestnode = null; 

IP 

1 1| for (en = nodes.elements(); en.hasMoreElements(); ) { 

1 W Node node = (Node)en.nextElement(); 

113 if (node.isRunning()) { 

ITS if((bestnode = null)|| 

1 17 (node.countJobs() < bestnodexountJobs())) 

1 j8 bestnode = node; 

lfS } 

m } 

1 2D return bestnode; 

IW } 

123 private void startJob(JobEntry job) { 

124 Node node - fmdNode(); 

125 if(node!=null){ 

126 System.outprintln("Cluster: startJob: node " + 

127 node.getNodeName() + 11 job M + job.getJobName()); 

1 28 node.addJob(job); 

129 //routeGob); 

130 } 

131 } 

132 private void stop Job(JobEntry job) { 
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133 Node node; 

1 34 node = job.getNode(); 

135 if (node != null) 

136 node.removeJobQob); 

137 } 

138 private void failJob(JobEntry job) { 

139 Node node = job.getNode(); 

140 node.failJob(job); 

141 } 

142 private void checkNodes() { 

1 43 Enumeration en; 

Iff System.out.println("Cluster: Available Nodes:"); 

l4p for (en = nodes.elements(); en.hasMoreElements(); ) { 

1 4$ Node node = (Node)en.nextElement(); 

14| node.pollNodeO; 

14| if (node.isRunningO) 

H9 System.outprintln("\tup " + node.getNodeName()); 

1IQ else 

1 System.out.println( M \tdown " + node.getNodeName()); 

151 } 

15f } 

1 54 private void checkJobs() { 

1 5 5 Enumeration en; 

156 for (en = jobs.elements(); en.hasMoreElements(); ) { 

1 57 JobEntry job = (JobEntry)en.nextElement(); 

158 if (job.isRunningO) { 

159 Node node = job.getNode(); 

160 if (!node.isRunning()) { 

161 failJob(job); 

162 } 

163 }else{ 

1 64 String jobname = job.getJobName(); 

165 startJob(job); 

166 } 
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LIST 3 : Cluster.java 
Copyright 2000 Market Engine Corporation 



167 } 

168 } 

169 private void poller() { 

170 System.out.println("Hostname: " + Hostname); 

1 7 1 System.out.println("Cluster: poller"); 

172 while (true) { 

173 try{ 

174 Threadsleep(lOOOO); 

175 // In this example, nodes and jobs are polled every 10 seconds. 

1 76 } catch (InterruptedException e) { 

177 //Ignore 

178 } 

m checkNodesO; 
ill checkJobs(); 

m } 

181 } 

1 84 public void run() { 
lU poller(); 

1 8| // Cluster.java is used by coordinator.java as a thread. When the thread is run, 

If if // Cluster.java automatically starts polling the nodes and jobs in behalf of the 

1 8i // coordinator. 

16 } 

im } 
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LIST 4: Jobentry.java 
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1 package MEC.Hydra; 

2 import . . . ; 
3 

4 import . . . ; 

5 // used by cluster java to keep to keep track of its jobs and the states of each job. 

6 // These are mostly synchronized to ensure proper state transitions. 

7 public class JobEntry { 

8 public static int IDLE = 0; 

9 public static int WANTRUN = 1 ; 

1 0 public static int STARTING = 2; 

1 1 public static int RUNNING = 3; 

12 public static int WANTSTOP - 4; 
U public static int STOPPED = 5 

[f i 
fit 

M private Node node = null; 

W private String progname = null; 

H private String jobname = null; 

I§ private boolean debug = false; 

iM 
i| 

3| protected int state = IDLE; 
22 

23 // 

24 // Constructors 

25 // 

26 public JobEntry(String progname, String jobname) { 

27 this.progname = new String(progname); 

28 this.jobname = new String(jobname); 

29 } 
30 

31 
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LIST4: Jobentry.java 
Copyright 2000 Market Engine Corporation 

32 // 

33 // public methods 

34 // 

35 // Deprecated 

36 public JobEntry(String jobname) { 

37 this("VECN", jobname); 

38 } 

39 public void setDebug(boolean d) { 

40 debug = d; 

41 } 

42 public synchronized boolean isRunning() { 

43 return ((state != IDLE) && (state != STOPPED)); 

tt > 

4i public synchronized boolean isStopped() { 

46J return ((state = IDLE) || (state == STOPPED)); 

4gE } 

4§ public synchronized String getProgName() { 
49 s return new String(progname); 

ft } 

3fjj public synchronized String getFobName() { 
41 return new String(jobname); 

^1 } 

54 public synchronized String getNodeName() { 

55 String nodename = null; 

56 if (node != null) { 

57 nodename = node.getNodeName(); 

58 } 

59 return nodename; 

60 } 

61 public synchronized Node getNode() { 

62 return node; 

63 } 
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LIST 4: Jobentry.java 
Copyright 2000 Market Engine Corporation 

64 public synchronized boolean addJob(Node node) { 

65 if (state = IDLE) { 

66 if (debug) 

67 System.outprintln("JobEntry: addJob " + jobname + 

68 n on " + node.getNodeName()); 

69 state = WANTRUN; 

70 this.node = node; 

71 return true; 

72 } 

73 System.out.println( M JobEntry: addJob " + jobname + 

74 " on " + node.getNodeName() + " failed, state " + state); 

75 return false; 

76 } 

13 public synchronized boolean removeJob() { 

7J if (state = STOPPED) { 

# if (debug) 

off System.out.println("JobEntry: removeJob " + jobname + 

?t " on " + node.getNodeName()); 

8j state = IDLE; 

Q node = null; 

84 return true; 

II } 

sl if (state = IDLE) { 

8H System. out.println("JobEntry: removeJob " + jobname + " failed"); 

&| } else { 

8j| System.out.println("JobEntry: removeJob " + jobname + 

w " on " + node.getNodeName() + " failed, state " + state); 

91 } 

92 return false; 

93 } 

94 public synchronized boolean startJob() { 

95 if (state = WANTRUN) { 

96 if (debug) 

97 System.out.println("JobEntry: startJob " + jobname + 

98 " on " + node.getNodeName()); 

99 state = STARTING; 

100 return true; 

101 } 

102 if (state = IDLE) { 
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LIST 4: Jobentry.java 
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103 System.out.println("JobEntry: startJob " + jobname + " failed"); 

104 } else { 

105 System.out.println("JobEntry: startJob " + jobname + 

106 " on " + node.getNodeNameO + " failed, state " + state); 

107 } 

108 return false; 

109 } 

1 10 public synchronized boolean started Job() { 

111 if (state = STARTING) { 

112 if (debug) 

1 13 System.out.println("JobEntry: startedJob " + jobname + 

1 14 " on " + node.getNodeNameO); 

115 state = RUNNING; 
1 IB return true; 

If } 

lfl if (state = IDLE) { 

1 1 S System.out.println(" JobEntry: startedJob " + j obname + " failed"); 

liQ } else { 

12| System. out.println(" JobEntry: startedJob " + jobname + 

122 " on " + node.getNodeNameO + " failed, state " + state); 

123 } 

121 return false; 

125 } 

126 public synchronized boolean stopJobO { 
1 1| if (state == RUNNING) { 

lli if (debug) 

129 System.out.println(" JobEntry: stop Job " + jobname + 

1 30 " on " + node.getNodeNameO); 

131 state = WANTSTOP; 

132 return true; 

133 } 

134 if(state = IDLE) { 

135 System. out.println(" JobEntry: stopJob " + jobname + " failed"); 

136 } else { 

137 System. out.println(" JobEntry: stopJob " + jobname + 

138 " on " + node.getNodeNameO + " failed, state " + state); 

139 } 

140 return false; 

141 } 
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142 public synchronized boolean stoppedJob() { 

1 43 if (state = WANTSTOP) { 

144 if (debug) 

145 System.out.println("JobEntry: stop Job " + jobname + 

1 46 " on " + node.getNodeNameO); 

147 state = STOPPED; 

148 return true; 

149 } 

150 if (state == IDLE) { 

1 5 1 System.out.println(" JobEntry: stopJob " + j obname + " failed"); 

152 }else{ 

153 System.out.println(" JobEntry: stop Job " + jobname + 

1 54 " on " + node.getNodeNameO + " failed, state " + state); 

155 } 

156 return false; 
ill } 

1 58 public synchronized boolean failJob() { 

1 53 if ((state = IDLE) || (state = STOPPED)) { 

1 6g System.out.println(" JobEntry: failJob " + jobname + " failed"); 

101 return false; 

162 s } 

1& if (debug) 

16| System. out.println(" JobEntry: failJob " + jobname + 

1 f I " on " + node.getNodeNameO + 

166 " failed, state " + state); 

m state = IDLE; 

1 IS node = null; 

1 69 return true; 

170 } 
171 

172 } 
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1 package MEC.Hydra; 

2 import 
3 

4 import ...; 

5 // used by Cluster.java to keep track of its nodes and the states of these nodes. 

6 // Some of these are synchronized to ensure proper state transitions. 

7 public class Node { 

8 public static int IDLE = 0; 

9 public static int RUNNING = 1 ; 
1 0 public static int STOPPED = 2; 

11 

t% private Cluster cluster; 

£X private String name; 

H private long oldID = 1 ; 

1$ private long newID = 0; 

Ki private Hashtable j obs; 

W private HashSet programs; 

f 8f private int state = IDLE; 

f 9 private int j obcount = 0; 

i 

% // Constructors 

22 // Create a new cluster node assigning it the name 'name' 

23 // and linking it back to its parent cluster. 

24 public Node(String name, Cluster cluster) { 

25 this.name = new String(name); 

26 this.cluster = cluster; 

27 programs = new HashSet(); 

28 j obs = new Hashtable() ; 

29 } 

30 

31 // Node methods 
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32 public String getNodeName() { 

33 return name; 

34 } 

35 public boolean isRunningO { 

36 return (state = RUNNING); 

37 } 

38 // Program methods 

39 public void addProgram(String progname) { 

40 System.out.println("HostAgent: addProgram " + progname); 

41 if ( ! programs.contains(progname)) { 

42 // Add program to our list 
4§J programs.add(progname); 

4| // Start up the program 

41: startProgram(progname); 

44 } 

41 } 

48^ public void startProgram(String progname) { 

4P System.out.println("HostAgent: startProgram ,f + progname): 

5W } 

5p£ // Job methods 

ST public void add Job( JobEntry j ob) { 

53 if (job.addJob(this)) { 

54 jobs.put(job.getJobName() ? job); 

55 jobcount-H-; 

56 startJob(job); 

57 System.outprintln("Node " + name + 

58 addJob: jobcount " + jobcount); 

60 } 

6 1 public void remove Job( JobEntry job) { 

62 if (job.isRunning()) 

63 stopJob(job); 
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64 jobcount--; 

65 jobs.remove(job); 

66 System.out.println("Node" + name + 

67 " ; removeJob: jobcount " + jobcount); 

68 } 

69 public void fail Job(JobEntry job) { 

70 if (job.isRunning()) 

71 job.failJob(); 

72 jobcount--; 

73 jobs.remove(job); 

74 System.outprintln("Node " + name + ": failJob: jobcount " + jobcount); 

75 } 

IB public int counUobs() { 

79 return jobcount; 

1 } 

85 // Private methods 

private void startJob(JobEntry job) { 

82= String jobname = job.geUobName(); 

#S String progname = job.getProgName(); 

8# String rminame; 

8f tmpTimer ti = new tmpTimer(Thread.currentThread() ? 5000); 

86 try { 

87 job.startJob(); 

88 rminame = Parameters.RmiName(name, "Coordinator"); 

89 HostAgentRMI ha = (HostAgentRMI) Naming.lookup(rminame); 

90 ha.startJob(progname ? jobname); 

91 ti.cancelO; 

92 job.startedJob(); 

93 } catch (Exception e) { 

94 ti.cancel(); 

95 } 

96 } 

97 private void stop Job(JobEntry job) { 
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98 String jobname = job.getJobName(); 

99 String progname = job.getProgName(); 

1 00 String rminame; 

101 tmpTimer ti = new tmpTimer(Thread.currentThread(), 5000); 

102 try { 

103 rminame = Parameters.RmiName(name, "Coordinator"); 

1 04 HostAgentRMI ha = (HostAgentRMI) Naming.lookup(rminame); 

1 05 ha.stopJob(progname ? jobname); 

106 ti.cancelQ; 

107 job.stoppedJob(); 

108 } catch (Exception e) { 

109 ti.cancel(); 

110 } 

m } 

1 f% II polls the nodes (host agents) every n milliseconds. In this case, n = 5 seconds. 

1 f| // This example uses RMI to communicate. 

1 ff| public void pollNode() { 

1 IS String rminame; 

1 M tmpTimer ti = new tmpTimer(Thread.currentThread() ? 5000); 

1 ti //System.outprintln("Node: pollNode: begin " + name); 

1ft try{ 

1 rminame = Parameter s.RmiName(name ? "HostAgent"); 

1 2fl HostAgentRMI ha = (HostAgentRMI) Naming.lookup(rminame); 

life newID = ha.getInstanceID(); 

1?| ti.cancel(); 

123: //System.out.println("Node: newID: " + newID); 

124 ha = null; 

125 state = RUNNING; 

126 //System.out.println("Node: pollNode: end " + name); 

127 } catch (Exception e) { 

128 ti.cancelO; 

129 newID = 0; 

130 state = STOPPED; 

1 3 1 //System.outprintln( n Node: pollNode: failed " + name); 

132 //System.out.println("Node: pollNode: e: " + e.getMessage()); 

133 } 

134 } 

135 public void pollJobsQ { 
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1 36 for (Enumeration en = jobs.elements(); en.hasMoreElements(); ) { 

137 JobEntry job = (JobEntry) en.nextElement(); 

138 ifGob.state=job.WANTRUN) { 

139 //stopJobOob); 

140 startJobOob); 

141 } else if (job.state == job.WANTSTOP) { 

142 stopJobO'ob); 

143 } 

144 } 

145 } 

146 } 
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package MEC.Hydra; 

import 
••• 

import ...; 

// In this example, one host agent is run on each node and a node is only 'active' to the 
// cluster if the host agent is in good health. 

public class HostAgent extends UnicastRemoteObject implements HostAgentRMI { 

// 

// Constructors 

// 

public HostAgentO throws RemoteException { 
superQ; // redundant 

// Run RMI registry service internally 

reg = LocateRegistry.createRegistry(Parameters.RMI_PORT); 

jobs = new HashSet(); 

sender = new Sender("localhost", 2000); 

} 



// 

// Private data 

// 

private Registry reg = null; // Registry 

private static String Hostname; // Local hostname 

private static String coordinatorHost; // Coordinator's hostname 

private static long instancelD; // Unique ID for this HostAgent 

private HashSet jobs; // List or running jobs (symbols) 

private Sender sender; 

// Communication node for VeCN 
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30 // 

31 // Public methods 

32 // 

33 public static String getHostName() { 

34 String hostname; 

35 try { 

36 InetAddress addr = InetAddress.getLocalHost(); 

37 hostname = addr.getHostName(); 

38 } catch (UnknownHostException err) { 

39 hostname = new String( n localhost M ); 

40 } 

i% return hostname; 

m } 

// Return this HostAgent's unique instance identifier. This value 

4| //is different each time the HostAgent is started and is used by 

4| // by the Coordinator to check if the HostAgent is alive. At the 

% II same time, the HostAgent checks if the Coordinator has asked it 

47 // for its ID in the past n milliseconds and restarts the 

41 // Coordinator if it has not. So checking goes both ways. 

41 public static long previous_call; 

SP public long getInstanceID() { 

5 1 previous_call = SystemxurrentTimeMillis(); 

52 return instancelD; 

53 } 

54 // This method is called periodically in order to restart the 

55 // Coordinator if it is not running anymore. In the case that 

56 // multiple instances of the Coordinator get started, they 

57 // detect each other and all but the Coordinator with the smallest 

58 //IP address and process id commit suicide 

59 public void checkCoordinatorQ { 
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60 if (SystemxurrentTimeMillis() - 5000 > previous call) 

61 { 

62 ••• 

63 // restart coordinator because we haven't heard 

64 // from it in over 5 seconds! 

65 } 

66 } 
67 

68 public void setCoordinatorHost(String host) { 

69 coordinatorHost = host; 

70 } 

XTJ public String getCoordinatorHost() { 
78 return coordinatorHost; 

II } 

1| // Start a job on the local nodecontroller (VeCN) - in this example, the VECN job is 

7f // assured to run and starting a new job means sending a message to the VECN and 

^ II telling it to add the processing of a new symbol. Alternatively, actual processes 

77 could 

1M II be started as illustrated in Host Agent-2.java. 

78 public boolean startJob(String progname, String jobname) { 
8| String fullname = progname + 7" + jobname; 

II // Confirm job is not running here 

82 if (! jobsxontains(fullname)) { 

83 //Addjobtoourlist 

84 j obs. add(fullname) ; 

85 // Tell NodeContoller (VeCN) to start running job (symbol) 

86 sendStopMessage(jobname); 

87 sendStartMessage(jobname); 

88 System.out.println( M HostAgent: started " + fullname); 

89 } 

90 return true; 

91 } 
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92 // Start a job on the local node controller (VeCN). In this example, the VeCN job 

93 // is assumed to run and starting a new job means sending a message to the VeCN, 

94 // signaling it to add a new symbol. Alternatively, actual processes can be 

95 // started, as illustrated in HostAgent-2.java. 

96 public boolean start Job(String jobname) { 

97 return startJob("VECN M , jobname); 

98 } 

99 // Stop a job (symbol) running on the local nodecontroller (VeCN) 

1 00 public boolean stopJob(String progname, String jobname) { 

101 String fullname = progname + 7" + jobname; 

1 02 // Is j ob running here? 

10,1 if (jobs.contains(fullname)) { 

1 (fft // Remove job from our list 

1 (f| jobs.remove(fullname); 

1061 // Tell local NodeController (VeCN) to stop running job (symbol) 

1 01? sendStopMessage(jobname); 

ldS System.out.println( n HostAgent: stopping " + fullname); 

1(8 } 

llS return true; 

lf| } 

1 12 // Stop a job (symbol) running on the local nodecontroller (VeCN) 

113 public boolean stopJob(String jobname) { 

1 14 return stop Job( n VECN" , jobname); 

115 } 

116 // Stop all j obs running 

117 public void stopAU Jobs() { 

1 1 8 String jobname; 

119 //Go through list of jobs and stop each one 

120 for (Iterator it = jobs.iterator(); it.hasNext(); ) { 

121 j obname = (String)it .next() ; 
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122 stopJob(jobname); 

123 } 

124 } 

125 public boolean checkJob(String jobname) { 

126 if (jobs.contains(jobname)) { 

127 return true; 

128 } else { 

129 return false; 

130 } 

131 } 

132 

131 // 

13$ // Private methods 

1M II 

Fil : 

Up // Send a message to local NodeController (VeCN) to start job (symbol) 

1 31 private void sendStartMessage(String j obname) { 
13g CUSIP cusip = new CUSIP(jobname); 

1 ActivateSymbolMessage message = new ActivateSymbolMessage(cusip); 

140 if (sender. SendMessage(message, true)) { 

1 41 sender.FlushBuffersO; 
ifl } else { 

142 System.out.println("HostAgent: sender failed"); 
141 } 

ifj } 

146 // Send a message to local NodeController (VeCN) to stop job (symbol) 

1 47 private void sendStopMessage(String jobname) { 

1 48 CUSIP cusip = new CUSIPGobname); 

1 49 DeactivateSymbolMessage message = new DeactivateSymbolMessage(cusip); 

150 if (sender. SendMessage(message ? true)) { 

1 5 1 sender.FlushBuffersO; 

152 } else { 

1 53 System.out.println( M HostAgent: sender failed"); 

154 } 

155 } 



156 
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1 57 // This method starts up the communication infrastructure and assigns an ID to itself. 

158 // After this, it is ready to accept jobs or other commands from the coordinator. 

159 public static void run() { 

1 60 Hostname = getHostName(); 

161 // Create and install a security manager 

1 62 if (System.getSecurityManager() = null) { 

1 63 System.setSecurityManager(new RMISecurityManagerO); 

164 } 

1 65 Date now = new Date(); 

1 66 instancelD = now.getTime(); 

1 M II Register with RMI 

1^1 try { 

1 % String rminame; 

1 7$ HostAgent obj = new HostAgent(); 

1 ?t rminame = Parameters.RmiName(Hostname, "HostAgent"); 

1^2- Naming.rebind(rminame ? obj); 

1% System.out.println( M HostAgent: bind " + rminame); 

llSj } catch (Exception err) { 

11| System.out.println( M HostAgent err: " + err.getMessage()); 

1 7f: err.printStackTrace(); 

\ith } 

im } 

179 

1 80 // Main - usually called when a node is started up (see shell script). 

181 public static void main(String args[]) { 

182 run(); 

183 } 

184 } 
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1 package MEC.Hydra; 

2 import 
3 

4 import 

5 // This is a second version of the HostAgent that can easily be used in conjuction with 

6 // the first version to get an extended implementation. 

7 /** 

8 * JOB THREAD CLASS 

9 * 

1 0 * This class is used to fork threads for jobs that the HostAgent 

1 1 * needs to keep track of. 

12 * Given the name of the java program to be called, it will invoke 
ES * the main program for that class. 

jjf * _ *j 

i% class JobThread implements Runnable { 

1 1 private int MAXARGS = 256; 

M private String prog; 

18 private String [] args; 

18 private int numArgs; 

2§ 

^ public JobThread(String cmd) { 

22 args = new String [MAXARGS] ; 

23 parseCommand(cmd); 

24 } 

25 

26 /* * parseCommand - 

27 * code to parse the program and arguments from 

28 * the command string. This could also be done with built-in functions. 

29 */ 

30 public void parseCommand(String cmd) { 

31 intspacel=0; 

32 int space2 = 0; 
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33 space2 = cmd.indexOf(" M , spacel); 

34 if(space2 = -l) { 

3 5 space2 = cmd. length() ; 

36 } 

37 prog = cmd.substring(spacel , space2); 

38 numArgs = 0; 

39 while (space2 < cmd.length()) { 

40 spacel = space2 + 1; 

41 space2 = cmd.indexOf(" M , spacel); 

42 if(space2==-l){ 

43 space2 = cmd.length(); 

44 } 

45 args [numArgs] = cmd. substring(space 1 ? space2) ; 

46 numArgs++; 

4P if (numArgs >= MAXARGS) { 

4| System.err.println( M Not enough space in args struct"); 

4$ return; 

M } 

i } 

|2 // System.err.println( M parseCommand: prog is " + prog + numArgs is " + 
l| numArgs); 

II // for (int i = 0; i < numArgs; i++) { 

5S // System.err.println( M arg " + i + " is " + args[i]); 

m a } 

s } 

58 /**run~ 

59 * call the jobThread class's main with specified command-line 

60 * arguments. 

61 */ 

62 public void run() { 

63 // System.out.println( M Starting job thread"); 

64 Class paramTypes [] = {(new String[0]).getClass()}; 

65 try { 

66 Class which = Class.forName(prog); 
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67 Object obj = which.newlnstance(); 

68 java.lang.reflect.Method method = which.getMethod("main", paramTypes); 

69 Obj ect[] invoke Args = new Obj ect[ 1 ] ; 

70 invokeArgs[0] = args; 

71 method.invoke(obj 5 invokeArgs); 

72 } 

73 catch (Exception err) { 

74 System.err.println(err.getMessage()); 

75 err.printStackTrace(); 

76 } 

77 } 

fi 

m } 



8(j /* 

81 * CHECK STATUS DAEMON CLASS 

|| * */ 

83 class CheckStatusThread implements Runnable { 

84 private HostAgent agent; 

85 CheckStatusThread(HostAgent ha) { 

86 agent = ha; 

87 } 

88 public void run() { 

89 System.out.println(" Starting check status thread"); 

90 // HostAgentxheckThreads(agent); 

91 HostAgent.checkProcs(agent); 
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92 } 

93 } 



94 /* . 

95 * HOST AGENT CLASS 

96 * 

97 * The host agent class will keep alive all jobs that it starts. 

98 * Started jobs are assumed to want to run forever -- there is no 

99 * "normal" exit state, so all started jobs are kept alive until 

100 * killJob is called in this version. 

101 * */ 

1Q2 : class HostAgent extends UnicastRemoteObject implements HostAgent_Interface { 

1 OS private Runtime runtime; 

1 (f§ private String HostName; 

1 Oft private String ProcessID; 

1 6k private final int MAX_WAIT_SECONDS = 60; 

1 0£ private final String COORD JPROGRAM = "/bin/coordinator" ; 

1 0$ // private Coordinator coordinator; 

109 private Process CoordProcess; // only set if Coordinator runs locally 

1 Ipj private Process FrontEndProcess; // only set if FrontEnd runs locally 

lf| 

1 IS /** HostAgent constructor - 

1 13" * Called during system boot process or by Coordinator after HostAgent 

114 * crash. 

115 */ 

1 1 6 public HostAgent(String hostname) throws java,rmi.RemoteException { 

117 super(); 

1 18 ProcessID = new String ("processID"); // XXX get process id 

1 1 9 runtime = Runtime.getRuntime(); 

120 HostName = hostname; 

121 // if (findCoordinator() = false) { 

122 //startCoordinator(); 

123 // } 

124 } 
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125 /* * pingHostAgent ~ 

1 26 * essentially a ping for the HostAgent process. 

127 */ 

128 public boolean pingHostAgent() throws java.rmi.RemoteException { 

129 return true; 

130 } 

131 

1 32 private void startProcess(String jobName, String cmd) { 

133 Process proc; 

134 try{ 

135 // System.outprintln( M startProcess: starting " + cmd); 
1 34 proc = runtime.exec(cmd); 

1|2 // System.out.println( M startProcess: reading output"); 

13| // byte[] data = new byte[256]; 

139 // proc.getInputStream().read (data); 

14$ // System.out write (data); 

1 4 W JobListaddJob(jobName ? cmd, proc); 

Hi } 

143| catch (javaao JOException err) { 

14J System.err.println( M HostAgent.startJob exec failed"); 

14ft System.err.println(err.getMessage()); 

146 } 

14l } 



148 private void startThread(String jobName, String cmd) { 

149 Thread t = new Thread (new JobThread(cmd)); 

150 tstartO; 

1 5 1 JobList addJob(jobName, cmd, t); 

152 } 

153 

1 54 public void startJob(String jobName, String cmd) { 

155 startProcess(jobName, cmd); 

156 // startThread(jobName, cmd); 

157 } 
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158 

1 59 public void killJob(String jobName) throws java.rmi.RemoteException { 

160 try { 

161 Object obj = JobList.removeJob(jobName); 

1 62 if (obj .getClass() == Thread.class) { 

1 63 Thread t = (Thread) obj; 

164 tdestroyO; 

165 if (tisAliveO) { 

166 HaraKiriO; 

167 } 

168 } 

1 69 if (obj.getClass() ^ Process.class) { 
1 ?§ Process proc = (Process) obj ; 
lift proc.destroy(); 

IW try{ 

1® proc.exitValueO; 

17# } 

1 72 catch (IllegalThreadStateException err) { 

1X6: HaraKiriO; 

177 } 

im > 

17l } 

1 8t catch (NoSuchJob err) { 

1 &t // Don't need to do anything here. 

lg } 

1 84 // Used to convert potential falures into full failures and to commit suicide when 

1 85 // multiple instances are started. This is usually done in instances of the 

186 // coordination of higher level matters. 

1 87 public void HaraKiri() { 

1 88 System.out.println("HostAgent commiting Hara Kiri"); 

189 System.exit(l); 

190 } 



191 /* */ 

192 static void checkThreads(HostAgent agent) { 
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1 93 int sleepTime = 1 000; // in milliseconds 

1 94 JobListReport [] joblist; 

195 Thread t; 

196 boolean alive; 

197 while (true) { 

198 try { 

199 Thread.sleep(sleepTime); 

200 } 

201 catch (java.lang.InterruptedException exp) { 

202 System.err.println("Thread interrupted"); 

203 } 

2(H joblist = JobList.geUobList(); 

20l for (int i=0; i < joblistlength; i++) { 

2m t = (Thread) joblist[i]. object; 

20P alive = tisAliveO; 

20§ // System.out.println(" Job " + joblist[i].jobName + " isAlive: " + alive); 

209 

2fQ if (alive == false) { 

2f| try { 

2t| System.out.println("Restarting " + joblist[i].jobName); 

2l| Object obj = JobList.removeJob(joblist[i].jobName); 

2tt agent.startJob(joblist[i].jobName, joblist[i]. command); 

2 Q } 

2 1 6 catch (No SuchJob err) { 

217 // if we get here, the j ob has been removed between 

218 // the getJobList and removeJob. Since it should 

219 // only be removed here or by specifically calling 

220 // killJob, a killJob must have been called, so 

22 1 // we don't want to restart this job. 

222 } 

223 } 

224 } 

225 

226 } 
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227 } 



228 static void checkProcs(HostAgent agent) { 

229 int sleepTime = 1000; // in milliseconds 

230 JobListReport [] j oblist ; 

23 1 Process proc; 

232 boolean alive; 

233 while (true) { 

234 try { 

235 Thread.sleep(sleepTime); 

2^ } 

23% catch (java.lang.InterruptedException exp) { 

23l System.err.println("Thread interrupted") ; 

2fe } 

24B joblist = JobListgetJobList(); 

2| 1 for (int i-0; i < joblist.length; i++) { 

lk% proc = (Process) joblist[i] .object; 

24| alive = false; 

24| try { 

24S proc.exitValue(); 

2% } 

247 catch (IllegalThreadStateException err) { 

248 alive = true; 

249 } 

250 // System.out.println("Job " + joblistp] jobName + " isAlive: " 4- alive); 

251 if (alive) { 

252 try { 

253 byte[] data = new byte[256]; 

254 proc. getInputStream(). read (data); 

255 System.out.write (data); 

256 } 

257 catch (IOException err) { 

258 System.err.println( n IO Error"); 

259 System.err.println(err.getMessage()); 
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260 } 

261 } 
262 

263 else { 

264 try { 

265 System.outprintln( M Restarting " + joblistp] jobName); 

266 Object obj = JobList.removeJob(joblist[i] jobName); 

267 agent.startJob(joblist[i] j obName, j oblist[i] .command); 

268 } 

269 catch (NoSuchJob err) { 

270 // if we get here, the job has been removed between 

271 // the getJobList and remove Job. Since it should 

272 // only be removed here or by specifically calling 

273 // killJob, a killJob must have been called, so 

274 // we don't want to restart this job. 

rfi } 

m } 

211 } 

m 

27® } 

2m } 

281 // Additional methods not used during demonstration mode are commented out here. 

2pj /** checkStatus 

2^| * Periodically called to check: 

28fC * 1) whether a Coordinator is running locally 

2$§[ * 2) whether a FrontEnd is running locally 

2|i * 3) status for each process in joblist 

287 * 4) overall system load 

288 * Currently system load is represented by the number of jobs. 

289 */ 

290 // private HostAgentReport checkThreadStatus() { 

291 // HostAgentReport report = new HostAgentReport(joblist.length); 

292 // Set jobNameSet = joblist.keySet(); 

293 // Iterator i = jobNameSet.iterator(); 

294 // int count; 

295 // Process proc; 

296 // Thread t; 

297 // report.jobNames = new String [joblist length]; 
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298 // for (count = 0; count < joblistlength; count++) { 

299 // report.jobNames[i] = i.next(); 

300 // proc = report.jobNames[i].getProcessO; 

301 // } 

302 // // do not restart if problem, just set values to null 

303 // report.localCoordinator = (CoordProcess != null); 

304 // repoitlocalFrontEnd = (FrontEndProcess != null); 

305 // report.systemLoad.jobqueue= joblistlength; 

306 // return report; 

307 // } 



M // /**getHostList- 

3(1 // */ 

3^ // private String [] getHostList() { 

3 tp // String allHostNames = { "cbo", "nyse" , "nasdaq", "pse" } ; 

3 PI // return allHostNames; 

31# // } 



3 M II /* * findCoordinator -- 

31J; II * Query all known hosts for Coordinator process. 

3 If // */ 

31f| // private boolean findCoordinator() { 

3 ll // String allHosts [] = makeAllHostList(); 

319 // Remote robj; 

320 // boolean foundCoordinator = false; 

321 // for (int i = 0; i < allHosts.length; i++) { 

322 // if (foundCoordinator — false) { 

323 // try{ 

324 // robj = Naming.lookup(7/" + allHosts[i] + "/Coordi- 

325 nator"); 

326 // coordinator = (Coordinator) robj; 

327 // foundCoordinator = coordinator.getCoordinatorO; 

328 // } 

329 // } 

330 // } 
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331 // return foundCoordinator; 

332 // } 

333 /** startCoordinator — 

334 * 1 ) Wait in case another host is starting a coordinator. 

335 * 2) Check again for Coordinator. 

336 * 3) If none exists, start one on local host. 

337 */ 

338 // private void startCoordinator() { 

339 // InetAddress InetAddr = InetAddress.getByName("localhost"); 

340 // int rand = (InetAddr + ProcessID) % MAX_WAIT_SECONDS; 

341 // Thread t = Thread.currentThread(); 

342 // try ( 

343 // tsleep(rand); 
34§ // } 

315 // catch (InterrupedException ie) { 

3*1 // } 

3#f // if (findCoordinatorO) { 

34.i II return; 

3# // } 

356 // else { 

3|l // try { 

3© // coordProcess = runtime.exec(COORD_PROGRAM); 

3 si // } 

3E // } 

3© // } 

356 // } 

357 // /**jobExit~ 

358 // * Called by exiting processes started by this HostAgent. 

359 // * Remove job from joblist. 

360 // */ 

361 // public void jobExit(String jobName) { 

362 // Process proc=joblist.getProcess(jobName); 

363 // joblist.removeJob(jobName); 

364 // } 
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1 * 



LIST 7: HostAgent-2.java 
Copyright 2000 Market Engine Corporation 



365 // public void reportStatus() { 

366 // HostAgentReport report = checkStatus(); 

367 // try { 

368 // coordinator.reportStatus(report); 

369 // } 

370 // catch (RemoteException err) { 

371 // startCoordinator(); 

372 // } 

373 // } 
374 

3TJ5 // In this example code, some jobs are started and kept alive, even over 'kill-9* com- 
3W mands. 

3lf public static void main (Stringfl args) { 

3i'i HostAgent hostAgent; 

37jj String hostname = null; 

3$J String port = null; 

381 if(args.length !=2) { 

3|g System.err.println("usage: hostname port"); 

383 return; 

383 } 

3 if hostname = args [0] ; 

386 port = args[l]; 

387 System.setSecurityManager (new RMISecurityManagerO); 

388 try { 

389 hostAgent = new HostAgent(hostname); 

390 Naming.rebind ("//" + hostname + ":" + port + "/HostAgent", hostAgent); 

391 } 

392 catch (Exception e) { 

393 System.err.println("Failed to register HostAgent"); 

394 System.out.println(e.getMessage()); 

395 e.printStackTraceO; 
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LIST 7: HostAgent-2.java 
Copyright 2000 Market Engine Corporation 



396 return; 

397 } 

398 System.outprintln("HostAgent started on " + hostAgentHostName); 

399 // start checkStatus daemon 

400 Thread t = new Thread (new CheckStatusThread(hostAgent)); 

401 t.start(); 

402 hostAgent.startJob( n vecn", "/bin/java -Xmsl 00m -Xmx200m 

403 MEC.NodeController.NodeController IBM INKT MSFT YHOO QCOM WCOM INTC DELL 

404 ORCL AMZN CSCO GSTRF"); 

405 // hostAgent.startJob( M demoA", "/bin/java DemoAppA"); 

406 // hostAgent.startJob("demoB", "/bin/java DemoAppB"); 
40=7= // hostAgent.startJob("demoA", "DemoAppA"); 

4(H // hostAgent.startJob("demoB", "DemoAppB"); 

40S } 

4li } 
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While the invention has been particularly shown and described with reference to one em- 
bodiments thereof it will be understood by those skilled in the art that various changes in form 
and details may be made therein without departing from the scope of the invention. 
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CLAIMS 

1 1 . (Original) A fault tolerant computer system for executing one or more jobs on one or more 

2 nodes, comprising, 

3 a hierarchy of monitors for monitoring operations in the computer system including, 

4 one or more first monitors for monitoring first operations and, for any particular 

5 one of said first operations that fails, for restarting another instance of said 

6 particular one of said first operations, 

7 one or more second monitors for monitoring said first monitors and, if any partic- 

8 ular one of said first monitors fails, for restarting another instance of said 

9 particular one of said first monitors. 

r| 2. (Original) The system of Claim 1 wherein, 

^ said one or more of said second monitors are monitored by at least one of said first moni- 

I S tors and, if any particular one of said second monitors fails, said at least one of 

\| said first monitors restarts another instance of said particular one of said second 

!5 monitors. 

C| 3. (Original) The system of Claim 2 wherein one or more of said second monitors operates to 
commit suicide if more than one of said another instance of said particular one of said second 

□ monitors is restarted. 

-is J 

1 4. (Original) The system of Claim 1 wherein, 

2 said nodes operate to execute processes in a service unit, a communication unit and a re- 

3 source management unit. 

1 5. (Original) The system of Claim 1 wherein each of said nodes includes a computer having 

2 an operating system, wherein pluralities of nodes form clusters and wherein each cluster has a 

3 corresponding instantiation of said hierarchy of monitors for monitoring operations in the com- 

4 puter system. 



Attorney Docket No.: ATAE1015DEL Express Mail Label No. :EL32 829628 6US 

ioi5_oo^07 A 20.fi.wpd Page 79 of 94 7/20/0-22:31 



1 6. (Original) The system of Claim 5 wherein each instantiation of said hierarchy of monitors 

2 includes, 

3 a first instantiation of said one or more first monitors for monitoring first instantiation 

4 operations and, for any particular one of said first instantiation operations that 

5 fails, for restarting another instance of said particular one of said first instantiation 

6 operations, 

7 a second instantiation of said one or more second monitors for monitoring said first mon- 

8 itors of said first instantiation and, if any particular one of said first monitors of 

9 said first instantiation fails, for restarting another instance of said particular one of 
10 said first monitors of said first instantiation. 

m 7. (Original) The system of Claim 5 including first and second instantiations and wherein, 

r| said one or more of said second monitors of said second instantiation are monitored by at 

least one of said first monitors of said first instantiation and, if any particular one 

4 of said one or more of said second monitors of said second instantiation fails, for 

J restarting another instance of said particular one of said one or more of said sec- 

ond monitors of said second instantiation. 

p| 8. (Original) The system of Claim 1 wherein, 

CI said second monitors maintain a record of particular ones of the first monitors that are 

3 active and corresponding active particular ones of said first operations being mon- 

4 itored by said particular ones of the first monitors. 

1 9. (Original) The system of Claim 8 wherein, 

2 said second monitors use said record to ensure that active particular ones of said first op- 

3 erations monitored by a failing one of said particular ones of the first monitors 

4 that are active is monitored by a new instance of said failing one of said particular 

5 ones of the first monitors that are active. 
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1 10. (Original) The system of Claim 1 wherein said hierarchy of monitors includes, 

2 one or more additional monitors for monitoring said first monitors or said second moni- 

3 tors, and, if any particular one of said first monitors or said second monitors fails, 

4 restarting another instance of said particular one of said first monitors or said sec- 

5 ond monitors. 

1 11. (Original) The system of Claim 10 wherein said hierarchy of monitors includes, 

2 one or more other monitors for monitoring said first monitors, said second monitors or 

3 said additional monitors, and, if any particular one of said first monitors, said sec- 

4 ond monitors or said additional monitors fails, restarting another instance of said 

5 particular one of said first monitors, said second monitors or said additional moni- 

'S tors - 

S S 12. (Original) The system of Claim 1 wherein, 

M said first operations are jobs running on said nodes for providing services and, for any 

J particular one of said jobs that fails, one of said first monitors restarts another in- 

E4 stance of said particular one of said jobs. 

% ■ 13. (Original) The system of Claim 12 wherein said jobs implement e-commerce transaction 

□ services. 

1 14. (Original) The system of Claim 12 wherein said jobs implement transaction services for 

2 financial instruments. 

1 15. (Original) The system of Claim 12 wherein said first monitors are host agents for monitor- 

2 ing operations of a plurality of jobs on a plurality of nodes where each job is monitored by only 

3 one of said host agents. 
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1 16. (Original) The system of Claim 12 wherein said first monitors are one or more agents op- 

2 erating on a first level, each of said agents for monitoring operations of jobs on nodes where 

3 each job is monitored by only one of said agents. 

1 17. (Original) The system of Claim 12 wherein, 

2 said first monitors are one or more agents operating on a first level, each of said agents 

3 for monitoring operations of jobs on nodes where each job is monitored by only 

4 one of said agents, and 

5 said one or more second monitors includes one or more local coordinators operating on a 

6 second level where each local coordinator monitors one or more of said agents. 

,1 18. (Original) The system of Claim 12 wherein, 

said first monitors are one or more agents operating on a first level, each of said agents 

f| for monitoring operations of jobs on nodes where each job is monitored by only 

% one of said agents, and wherein a particular one of said agents runs on a particular 

*& one of said nodes where a job monitored by said particular one of said agents 

s 6 runs. 

f l 19. (Original) The system of Claim 12 wherein, 

f| said first monitors are one or more agents operating on a first level, each of said agents 

for monitoring operations of jobs on nodes where each job is monitored by only 

4 one of said agents, and wherein a particular one of said agents runs on a particular 

5 one of said nodes where a job monitored by said particular one of said agents runs 

6 on other of said nodes than said particular one of said nodes. 
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1 20. (Original) The system of Claim 12 wherein, 

2 said first monitors are one or more agents operating on a first level, each of said agents 

3 for monitoring operations of jobs on nodes where each job is monitored by only 

4 one of said agents, and wherein a particular one of said agents runs on a particular 

5 one of said nodes where a job monitored by said particular one of said agents 

6 runs, 

7 said second monitors are one or more local coordinators operating on a second level, 

8 each of said local coordinators for monitoring operations of agents, and wherein a 

9 particular one of said local coordinators runs on a particular one of said nodes 
10 where an agent monitored by said particular one of said local coordinators runs. 

CI 21. (Original) The system of Claim 12 wherein, 

f I said first monitors are one or more agents operating on a first level, each of said agents 

l % for monitoring operations of jobs on nodes where each job is monitored by only 

M one of said agents, and wherein a particular one of said agents runs on a particular 

one of said nodes where a job monitored by said particular one of said agents 

JL6 runs, 

jl said second monitors are one or more local coordinators operating on a second level, 

j| each of said local coordinators for monitoring operations of agents, and wherein a 

J| particular one of said local coordinators runs on a particular one of said nodes 

10 other than where an agent monitored by said particular one of said local coordina- 

1 1 tors runs. 
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1 22. (Original) The system of Claim 12 wherein, 

2 said first monitors are one or more agents operating on a first level, each of said agents 

3 for monitoring operations of jobs on nodes where each job is monitored by only 

4 one of said agents, 

5 said second monitors are one or more local coordinators operating on a second level, 

6 each of said local coordinators for monitoring operations of agents, 

7 and wherein said hierarchy of monitors includes, 

8 one or more third monitors for monitoring said one or more second monitors and, for any 

9 particular one of said second monitors that fails, restarting another instance of 

10 said particular one of said second monitors, and wherein a particular one of said 

1 1 third monitors that monitors said particular one of said second monitors runs on a 
t2 different node than a node where said particular one of said second monitors runs. 

f II 23 . (Original) The system of Claim 22 wherein said hierarchy of monitors includes, 

%2 one or more fourth monitors for monitoring said one or more third monitors and, for any 

f 3 particular one of said third monitors that fails, restarting another instance of said 

s 4 particular one of said third monitors, and wherein a particular one of said fourth 

t| monitors that monitors said particular one of said third monitors runs on a differ- 

* !6 ent node than a node where said particular one of said third monitors runs. 
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1 24. (Original) The system of Claim 12 wherein, 

2 said first monitors are one or more agents operating on a first level, each of said agents 

3 for monitoring operations of jobs on nodes where each job is monitored by only 

4 one of said agents, 

5 said second monitors are one or more local coordinators operating on a second level, 

6 each of said local coordinators for monitoring operations of agents, 

7 and wherein said hierarchy of monitors includes, 

8 one or more third monitors for monitoring said one or more second monitors and, for any 

9 particular one of said second monitors that fails, restarting another instance of 

10 said particular one of said second monitors, and wherein a particular one of said 

1 1 third monitors that monitors said particular one of said second monitors runs on a 
\2 node where said particular one of said second monitors runs. 

fJl 25 . (Original) The system of Claim 24 wherein said hierarchy of monitors includes, 

one or more fourth monitors for monitoring said one or more third monitors and, for any 

43 particular one of said third monitors that fails, restarting another instance of said 

s 4 particular one of said third monitors, and wherein a particular one of said fourth 

monitors that monitors said particular one of said third monitors runs on a node 

f || where said particular one of said third monitors runs. 

Cl 26. (Original) The system of Claim 1 wherein said hierarchy of monitors includes, 

2 one or more third monitors for monitoring said one or more second monitors and, for any 

3 particular one of said second monitors that fails, restarting another instance of 

4 said particular one of said second monitors. 

1 27. (Original) The system of Claim 26 wherein one or more of said second monitors operates 

2 to commit suicide if more than one of said instance of said particular one of said second moni- 

3 tors is restarted. 
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1 28. (Original) The system of Claim 26 wherein said one or more third monitors run on differ- 

2 ent ones of said nodes than ones of said nodes on which said second monitors run. 

1 29. (Original) The system of Claim 26 wherein said hierarchy of monitors includes, 

2 one or more fourth monitors for monitoring said one or more third monitors and, for any 

3 particular one of said third monitors that fails, restarting another instance of said 

4 particular one of said third monitors. 

1 30. (Original) The system of Claim 29 wherein said one or more fourth monitors run on dif- 

2 ferent ones of said nodes than ones of said nodes on which said third monitors run. 

1 31. (Original) The system of Claim 29 wherein said one or more fourth monitors run on ones 

5 of said nodes which are the same as ones of said nodes on which said third monitors run. 

4= 32. (Original) The system of Claim 29 wherein one or more of said third monitors operates to 

j£ commit suicide if more than one of said instance of said particular one of said third monitors is 

3 restarted. 

fl 33. (Original) The system of Claim 1 having a resource management unit including a load- 

% balancing for distributing jobs among said nodes. 

1 34. (Original) The system of Claim 1 having a resource management unit including a persis- 

2 tent storage unit. 

1 35. (Original) The system of Claim 1 having a resource management unit including an inter- 

2 face unit. 

1 36. (Original) The system of Claim 1 wherein, 

2 each of said nodes includes a plurality of computers each having an operating system. 
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1 37. (Original) The system of Claim 1 having a plurality of clusters of said nodes, each cluster 

2 having a corresponding instantiation of said hierarchy of monitors for monitoring operations in 

3 the computer system. 

1 38. (Original) The system of Claim 37 wherein, 

2 each of said clusters of nodes operates to execute processes organized into a service unit, 

3 a communication unit and a resource management unit. 



39. (Original) The system of Claim 37 wherein, 

said clusters of nodes are organized into groups, each group having one or more of said 
clusters. 



is 1 

*pj 40. (Original) The system of Claim 37 wherein, 

*S a first one of said groups is located at a geographic location remote from a second one of 

$ said groups and said first one of said groups is connected to said second one of 

s 4 said groups by one or more networks. 

f i 41. (Original) The system of Claim 3 7 wherein, 

3 a first one of said groups is organized to execute on one subset of data and a second one 

□ of said groups is organized to execute on another subset of data. 

1 42. (Original) The system of Claim 37 wherein, 

2 a first one of said groups is organized to execute on one subset of data and a second one 

3 of said groups is organized to provide backup for said one subset of data. 
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1 43. (Original) The system of Claim 1 wherein, 

2 said first operations are jobs running on said nodes for providing services, 

3 said first monitor senses one or more conditions that can cause any particular one of said 

4 jobs to fail whether or not said particular one of said jobs has actually failed, 

5 one of said first monitors terminates said particular one of said jobs and restarts another 

6 instance of said particular one of said jobs. 

1 44. (Original) The system of Claim 43 wherein, 

2 said one of said first monitors that terminates said particular one of said jobs restarts said 

3 another instance of said particular one of said jobs in an environment where said 

4 one or more conditions are not present. 

^1 45. (Original) The system of Claim 43 wherein, 

01 

f 3 said one of said conditions is a node failure and said another instance of said particular 

3 one of said jobs is started on a different non-failing node. 

™ 1 46. (Original) The system of Claim 43 wherein, 

said one of said conditions is a job failure and said another instance of said particular one 

V4 of said jobs is started as a new instance of said job. 

"1 47. (Original) The system of Claim 46 wherein, 

2 said another instance of said particular one of said jobs is started as a new instance of 

3 said job on a node the same as a node on which said particular one of said jobs was running. 

1 48. (Original) The system of Claim 46 wherein, 

2 said another instance of said particular one of said jobs is started as a new instance of 

3 said job on a new node different from a node on which said particular one of said jobs was run- 

4 ning. 
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1 49. (Original) The system of Claim 1 wherein each of said nodes includes a computer and 

2 wherein new ones of said nodes are added to the system without disturbing the operations of 

3 other of said nodes in the computer system and wherein jobs are assigned dynamically to said 

4 new ones of said nodes. 

1 50. (Original) The system of Claim 1 wherein each of said nodes includes a computer and 

2 wherein ones of said nodes are removed from the system without disturbing the operations of 

3 other of said nodes in the computer system and wherein particular jobs are reassigned dynami- 

4 cally to other of said nodes in the computer system. 

1 51 . (Original) The system of Claim 1 wherein each of said nodes includes a computer of one 
r? 2 type and wherein new ones of said nodes are added to the system including upgraded computers 

of a different type without disturbing the operations of other of said nodes in the computer sys- 

fjjj tern and wherein jobs are assigned dynamically from said other of said nodes to said new ones of 

3 said nodes to provide dynamic upgrade of said system without stopping said particular jobs. 

g I 52. (Original) The system of Claim 1 wherein pluralities of nodes form clusters and wherein 

J;i particular ones of said clusters are assigned for processing particular jobs at particulars times and 

% wherein other ones of said clusters are assigned for processing said particular jobs at other times. 

H 53. (Original) The system of Claim 52 wherein said particular times and said other times are 

2 follow-the-sun times. 

1 54. (Original) The system of Claim 1 'wherein a delay time is controlled before the restart of a 

2 job. 

1 55. (Original) The system of Claim 1 wherein a delay time is controlled before the restart of a 

2 job. An interface that allows humans to monitor the health of the system and to log statistics 

3 about uptime of each component in the system. 
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1 56. (Original) The system of Claim 1 wherein a delay time is applied before said restarting 

2 another instance of said particular one of said first operations. 

1 57. (Original) The system of Claim 1 wherein in said hierarchy of monitors, 

2 said one or more of said second monitors are monitored by at least one of said first moni- 

3 tors and, if any particular one of said second monitors fails, said at least one of 

4 said first monitors, after a first delay time, restarts another instance of said partic- 

5 ular one of said second monitors on a node other than a node on which said par- 

6 ticular one of said second monitors failed. 

J 58. (Original) The system of Claim 57 wherein, 

v3 if more than one instance of said another instance of said particular one of said second 

p|| monitors is restarted, all but one instance of said another instance of said particu- 

; % lar one of said second monitors commits suicide. 

J 59. (Original) The system of Claim 57 wherein said hierarchy of monitors includes, 

C| one or more additional monitors for monitoring said first monitors and said second moni- 

r3 tors, and, if any particular one of said first monitors or said second monitors fails, 

% restarting, after a second delay time, another instance of said particular one of 

C| said first monitors or said second monitors. 

1 60. (Original) The system of Claim 59 wherein, 

2 if more than one of instance of said another instance of said particular one of said first 

3 monitors or said second monitors is restarted, all but one instance of said another 

4 instance of said particular one of said first monitors or said second monitors oper- 

5 ates to commit suicide. 
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1 6L (Original) The system of Claim 58 wherein said hierarchy of monitors includes, 

2 one or more other monitors for monitoring said first monitors, said second monitors and 

3 said additional monitors, and, if any particular one of said first monitors, said sec- 

4 ond monitors or said additional monitors fails, restarting, after a third delay time, 

5 another instance of said particular one of said first monitors, said second monitors 

6 or said additional monitors. 

1 62. (Original) The system of Claim 61 wherein, 

2 if more than one instance of said another instance of said particular one of said first mon- 

3 itors, said second monitors or said additional monitors is restarted, all but one in- 

4 stance of said another instance of said particular one of said first monitors, said 

5 second monitors or said additional monitors operates to commit suicide. 

Jjj 63. (Original) The system of Claim 1 wherein, 

M said first operations are jobs running on said nodes for providing services where a partic- 

J ular first one of said jobs associated with a first customer is running on a particu- 

4 lar first node and a particular second one of said jobs associated with a second 

£3 customer is running on said particular first node. 

ry 

;§ 64. (Original) The system of Claim 1 wherein, 

p said first operations are jobs running on said nodes for providing services where a partic- 

3 ular first one of said jobs associated with a first customer is running on a particu- 

4 lar first node and a particular second one of said jobs associated with a second 

5 customer is running on a particular second node whereby said first customer job 

6 is isolated from said second customer job. 
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1 65. (Original) The system of Claim 1 wherein, 

2 said first operations are jobs running on said nodes for providing services where, 

3 particular first ones of said jobs are associated with a first customer with one of 

4 said particular first ones of said jobs running on a particular first node and 

5 with another one of said particular first ones of said jobs running on a 

6 particular other node; 

7 particular second ones of said jobs are associated with a second customer with 

8 one of said particular second ones of said jobs running on a particular sec- 

9 ond node and with another one of said particular second ones of said jobs 
1 0 running on said particular other node. 

JL 66. (Original) The system of Claim 1 including transaction initiators for starting said first op- 

%j2 erations as one or more jobs to initiate a transaction in a service. 

1% 67. (Original) The system of Claim 1 including transaction processors for starting said first 

M operations as one or more jobs to process a transaction in a service. 

\i 68. (Original) The system of Claim 1 including, 

f 3 transaction initiators for starting first ones or more of said first operations as one or more 

p| first jobs on a first node to initiate a transaction in a service; 

E3 transaction processors for starting other ones or more of said first operations as one or 

5 more other jobs on another node to process said transaction in said service. 

1 69. (Original) The system of Claim 1 including, 

2 transaction initiators for starting first ones or more of said first operations as one or more 

3 first jobs on a first node to initiate a transaction in a service; 

4 transaction processors for starting other ones or more of said first operations as one or 

5 more other jobs on another node to process said transaction in said service. 
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1 70. (Original) The system of Claim 1 including, 

2 transaction initiators for starting first ones or more of said first operations as one or more 

3 first jobs on a first node to initiate a transaction in a service; 

4 transaction processors for starting other ones or more of said first operations as one or 

5 more other jobs on said first node to process said transaction in said service. 

1 71. (Original) In a fault tolerant computer system operating to execute one or more jobs on 

2 one or more nodes where the computer system includes a hierarchy of monitors for monitoring 

3 operations in the computer system, the method comprising, 

4 monitoring first operations with one or more first monitors and, for any particular one of 

5 said first operations that fails, restarting another instance of said particular one of 
f 4 sa id first operations, 

W monitoring said first monitors with one or more second monitors and, if any particular 

f 8 one of said first monitors fails, restarting another instance of said particular one of 

S said first monitors. 

£ 1 72. (Original) The method of Claim 71 wherein, 

Kf monitoring said one or more of said second monitors with at least one of said first mon- 

f | itors and, if any particular one of said second monitors fails, restarting with said 

Of at l eas t one of sa id first monitors another instance of said particular one of said 

' :,r 5 second monitors. 

1 73. (Original) The method of Claim 2 wherein one or more of said second monitors operates 

2 to commit suicide if more than one of said another instance of said particular one of said second 

3 monitors is restarted. 
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1 ABSTRACT 

2 A computer system having a fault-tolerance framework in an extendable computer archi- 

3 tecture. The computer system is formed of clusters of nodes where each node includes computer 

4 hardware and operating system software for executing jobs that implement the services provided 

5 by the computer system. Jobs are distributed across the nodes under control of a hierarchical 

6 resource management unit. The resource management unit includes hierarchical monitors that 

7 monitor and control the allocation of resources. In the resource management unit, a first moni- 

8 tor, at a first level, monitors and allocates elements below the first level. A second monitor, at a 

9 second level, monitors and allocates elements at the first level. The framework is extendable 

10 from the hierarchy of the first and second levels to higher levels where monitors at higher levels 

1 1 each monitor lower level elements in a hierarchical tree. If a failure occurs down the hierarchy, 
\2 a higher level monitor restarts an element at a lower level. If a failure occurs up the hierarchy, a 

18 lower level monitor restarts an element at a higher level. Each of the monitors includes termina- 

if i 

14 tion code that causes an element to terminate if duplicate elements have been restarted for the 

f 5 same job. The termination code in one embodiment includes suicide code whereby an element 

# will self-destruct when the element detects that it is an unnecessary duplicate element. 
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